ForCES Working Group J. Hadi Salim Internet-Draft Znyx Networks Expires: April 25, 2004 R. Haas IBM Research S. Blake Ericsson October 26, 2003 Netlink2 as ForCES Protocol draft-jhsrha-forces-netlink2-02.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http:// www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 25, 2004. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract This document describes Netlink2, which is an extension of Linux Netlink [RFC3549]. This document is intended as a proposal for the ForCES IETF working group protocol. ForCES attempts to define a clear separation between the two entities of the NE in order to have them evolve separately as opposed to the current monolithic evolution. Conventions used in this document Hadi Salim, et al. Expires April 25, 2004 [Page 1] Internet-Draft Netlink2 as ForCES Protocol October 2003 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 3. Netlink2 Overview . . . . . . . . . . . . . . . . . . . . . 6 4. Summary of Netlink2 Modifications to Netlink . . . . . . . . 7 4.1 Header Modifications . . . . . . . . . . . . . . . . . . . . 7 4.2 Addressing and Transport Extensions . . . . . . . . . . . . 8 5. Netlink2 Message Format . . . . . . . . . . . . . . . . . . 9 5.1 Netlink2 Message Header . . . . . . . . . . . . . . . . . . 9 5.2 Type Length Value . . . . . . . . . . . . . . . . . . . . . 13 5.3 Encapsulated TLVs . . . . . . . . . . . . . . . . . . . . . 14 5.4 Netlink2-extension TLVs . . . . . . . . . . . . . . . . . . 14 6. Addressing and Transport Extensions . . . . . . . . . . . . 16 6.1 Transport Methods . . . . . . . . . . . . . . . . . . . . . 16 6.1.1 Why Multicast? . . . . . . . . . . . . . . . . . . . . . . . 16 6.1.2 Why IP? . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.1.3 Why UDP/TCP/SCTP/DCCP? . . . . . . . . . . . . . . . . . . . 17 6.2 The Netlink2 wire and bundle . . . . . . . . . . . . . . . . 17 6.2.1 What wires go in a bundle? . . . . . . . . . . . . . . . . . 18 6.3 Redefining the Netlink PID Semantics . . . . . . . . . . . . 20 6.4 Local Scope Addressing and Encapsulation . . . . . . . . . . 21 6.5 Global Scope Addressing and Encapsulation . . . . . . . . . 21 7. Protocol Architecture . . . . . . . . . . . . . . . . . . . 23 7.1 Protocol Phases . . . . . . . . . . . . . . . . . . . . . . 23 7.1.1 The Pre-Association Phase . . . . . . . . . . . . . . . . . 23 7.1.2 The Association Phase . . . . . . . . . . . . . . . . . . . 23 7.1.3 Service Termination . . . . . . . . . . . . . . . . . . . . 24 7.2 Protocol Logical Model . . . . . . . . . . . . . . . . . . . 24 7.3 Service Addressing . . . . . . . . . . . . . . . . . . . . . 25 7.4 Service Templates . . . . . . . . . . . . . . . . . . . . . 26 7.5 Mechanisms for Creating Protocols . . . . . . . . . . . . . 26 7.5.1 Building Reliable Protocols . . . . . . . . . . . . . . . . 26 7.5.2 Building Availability . . . . . . . . . . . . . . . . . . . 27 7.5.3 The ACK Netlink2 Message . . . . . . . . . . . . . . . . . . 27 7.5.4 Batching . . . . . . . . . . . . . . . . . . . . . . . . . . 28 7.5.5 Atomicity and Ordering of Transactions . . . . . . . . . . . 29 8. Putting together the base protocol for WG charter . . . . . 30 8.1 Netlink2-Extension TLVs . . . . . . . . . . . . . . . . . . 30 8.1.1 Authentication . . . . . . . . . . . . . . . . . . . . . . . 30 8.1.2 Checksum . . . . . . . . . . . . . . . . . . . . . . . . . . 30 8.1.3 Message Priority . . . . . . . . . . . . . . . . . . . . . . 30 8.1.4 SYN COOKIE . . . . . . . . . . . . . . . . . . . . . . . . . 31 8.1.5 Name ID . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Hadi Salim, et al. Expires April 25, 2004 [Page 2] Internet-Draft Netlink2 as ForCES Protocol October 2003 8.2 LFB and FE Attributes and discovery . . . . . . . . . . . . 31 8.3 NE creation . . . . . . . . . . . . . . . . . . . . . . . . 31 8.3.1 FE State transitions . . . . . . . . . . . . . . . . . . . . 32 8.3.2 CE view of FE State transitions . . . . . . . . . . . . . . 34 8.3.3 SYN Message Format . . . . . . . . . . . . . . . . . . . . . 37 8.3.4 FIN Message Format . . . . . . . . . . . . . . . . . . . . . 37 8.3.5 NOOP Message Format . . . . . . . . . . . . . . . . . . . . 37 8.4 LFB and FE Service Templates . . . . . . . . . . . . . . . . 37 8.4.1 Physical Port and Address Functions . . . . . . . . . . . . 38 8.4.2 IPv4 and IPv6 L3 Forwarding Functions . . . . . . . . . . . 41 8.4.3 Filtering Functions . . . . . . . . . . . . . . . . . . . . 45 8.4.4 QoS Functions . . . . . . . . . . . . . . . . . . . . . . . 45 8.4.5 IPSEC Functions . . . . . . . . . . . . . . . . . . . . . . 45 8.4.6 Packet redirection Functions . . . . . . . . . . . . . . . . 45 8.4.7 Packet Mirroring Functions . . . . . . . . . . . . . . . . . 45 8.4.8 Packet Sampling Functions . . . . . . . . . . . . . . . . . 45 8.5 Security Considerations . . . . . . . . . . . . . . . . . . 45 8.5.1 Denial of Service (DoS) attacks . . . . . . . . . . . . . . 46 8.5.2 Authentication and Encryption . . . . . . . . . . . . . . . 46 References . . . . . . . . . . . . . . . . . . . . . . . . . 47 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 49 A. Sample Service Hierarchy . . . . . . . . . . . . . . . . . . 50 B. Sample Protocol for the foo IP Service . . . . . . . . . . . 52 B.1 Interacting with Other IP Services . . . . . . . . . . . . . 52 C. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Intellectual Property and Copyright Statements . . . . . . . 54 Hadi Salim, et al. Expires April 25, 2004 [Page 3] Internet-Draft Netlink2 as ForCES Protocol October 2003 1. Introduction The concept of IP control and forwarding separation was first introduced in the early 1980s by the BSD 4.4 routing sockets [Stevens]. The focus at that time was to provide a simple IP(v4) forwarding service and allow the control plane, either via a command line configuration tool or a dynamic route daemon, to control forwarding tables for that IPv4 forwarding service. The IP world has evolved considerably since then. Linux Netlink [RFC3549], when observed from a service provisioning and management point of view, takes routing sockets one step further by breaking the narrow focus on IPv4 forwarding. Since the Linux 2.1 kernel, Netlink has been providing the IP service abstraction for a few additional services other than classical RFC 1812 IPv4 forwarding. Netlink was designed with a goal of solving the forwarding and control separation. This means that many of the main issues have been thought through and resolved over the years. In other words Netlink is proven as a protocol addressing separation of forwarding and control. Netlink is also network-ready because it uses packet formating techniques and concepts (e.g., multicast addressing). This, and the availability of publicly running and tested code which is widely deployed, form a major motivator to base Netlink2 on Netlink. Netlink2 extends Linux Netlink to meet the requirements of the ForCES working group charter for a protocol. Netlink is extended to have a distributed addressing and transport scheme, and missing mechanisms are added to make Netlink2 meet the ForCES protocol requirements [ForCES_REQ]. Netlink2 operates in a mode where knowledge of the NE, its topology, and LFB modeling MAY have already been discovered, or is discovered within the Netlink2 protocol. Netlink2 can operate over a variety of link, network, and transport media. The transport and media includes but is not limited by: o L2 such as Ethernet, ATM, FR, etc, o over bus and I/O interfaces such as PCI, HT, PCI-express, etc o L3 IPV4, IPv6, IPX etc. o L4 and above such as TCP, UDP, SCTP, DCCP In the cases where required mechanisms are missing from the underlying media, they are compensanted for by Netlink2 extensions (refer to Section 8.1) Hadi Salim, et al. Expires April 25, 2004 [Page 4] Internet-Draft Netlink2 as ForCES Protocol October 2003 2. Definitions We use the definitions provided in [ForCES_REQ], as well as the following: Logical Functional Block (LFB): same as Forwarding Engine Components as defined in [RFC3549]. This is a forwarding datapath component in the FE driven by the ForCES protocol in order to achieve a certain service. Control Element Component (CPC): same as defined in Control Plane Component in [RFC3549]. This is a component in the CE that drives LFB(s) in order to achieve a certain service. Hadi Salim, et al. Expires April 25, 2004 [Page 5] Internet-Draft Netlink2 as ForCES Protocol October 2003 3. Netlink2 Overview A datapath packet processing service accomplished by an FE is represented as a logical functional block (LFB) in the FE. CE components (CPC) in the CE interact with LFBs over Netlink2 wires and bundles (described in Section 6.2) to configure and manage a certain service. The interactions between LFBs and CPCs are specific to each service and are defined using templates as presented in [RFC3549]. The Netlink2 message is used to communicate between the FE and CPC for configuration of LFBs, LFB events to the CPCs, and statistics or config querying/gathering (typically by a CPC). Other activities include transfer of control packets between FE and CPC. Netlink2 messages travel between the CPC and LFB over Netlink2 wires which are part of Netlink2 bundles. Netlink2 wires are abstractions similar to GSMP links [RFC3292], albeit without the limitation to ATM VP:VC, Ethernet link, or TCP connection only. For instance, the IPv4 Forwarding service (called NETLINK_ROUTE) defines a message template for handling IP routes and the message types to insert, remove, or query a route. The routing CPC(s) and the IPv4 Forwarding LFB(s) interact using these message templates and message types over the Netlink2 bundle to execute the IPv4 Forwarding service. The message types in Netlink2 messages allow the FE to demultiplex messages to the appropriate LFB. Messages of a certain service destined to a LFB can travel on different Netlink2 wires within the same bundle Netlink2 by itself constitutes a base ForCES protocol with a set of mechanisms that can be utilized depending on service requirements. For example, for certain messages between the FE and CE, reliability can be enforced at the transaction level by setting the appropriate flags in the Netlink2 message. However, by default, Netlink2 transactions are not acknowledged. Hadi Salim, et al. Expires April 25, 2004 [Page 6] Internet-Draft Netlink2 as ForCES Protocol October 2003 4. Summary of Netlink2 Modifications to Netlink To conform to the ForCES requirements [ForCES_REQ], the Netlink protocol [RFC3549] is extended in the following respects: 1. Base header modifications, and feature expandability extensions by means of optional header TLVs to accommodate current generic ForCES requirements and to make it possible to add more in the future. This facilitates adding such features as authentication, checksumming, etc., when required. 2. IP and Transport encapsulations to carry Netlink messages. With these complementary changes to the existing Netlink functionality, Netlink2 fulfills the requirements to become the ForCES protocol. 4.1 Header Modifications 1. PID field redefinition and addition. In Netlink, PID 0 referred to the equivalent of the FE (kernel). The equivalent of the CE (user process) was referred by its OS process id. In Netlink2, the PID has additional semantics which give it group identity, unicast capability, etc (discussed later in Section 6.3). A PID of the unicastPID type is assigned to each FE and CE in the pre-association phase. In this way the CE uniquely identifies the FE and avoids any collision. We maintain the name PID for historical purposes. * Destination PID: the PID field is redefined as the Destination PID field. This field identifies the parties on the wire that must process the message. * Source PID: this field is introduced in the header to identify the source of the message. Different types of PIDs are discussed in Section 6.3. 2. The Length field has been reduced to 16 bits, with length 0 being reserved. The rest of the old 32-bit Length field is now split between a new version field and a new extended flags field. 3. A Version field is introduced in the Netlink2 header. This 8-bit Hadi Salim, et al. Expires April 25, 2004 [Page 7] Internet-Draft Netlink2 as ForCES Protocol October 2003 field is 4 bits major number and 4 bits minor number in the form of major:minor. For Netlink2, this becomes: 0x20. 4. A new Extended Flags field is introduced to take over the remaining 8 bits from the 16-bits taken from the original 32-bit Length field in Netlink. Turning different bits on enables additional new features such as proclaiming the presence of extended TLVs, etc. 5. Netlink2-extension TLVs follow directly after the Netlink2 base header. They are optional and their purpose is to extend the Netlink2 header. Typical use of Netlink2-specific TLVs is to compensate for capabilities lacking in a underlying transport. For example, in an IP network not deployed with IPSEC, the Netlink2-specific authentication TLV could be used to emulate the features provided by IPSEC-AH. 6. There could be more than one IP service configuration template within a Netlink2 message (as opposed to a single service template per netlink message). Implementation experience Section 6.3 has shown embedding multiple service templates improves performance of FE configuration. Other than these changes, all mechanisms provided by Netlink are sufficient to meet the requirements for ForCES. The reader is encouraged to refer to [RFC3549] as a companion to this one. 4.2 Addressing and Transport Extensions 1. Support for UDP/TCP/SCTP/DCCP transport over unicast/multicast IP (Section 6.1). 2. Support for bundles (Section 6.2). 3. Message recipient scoping using the Destination PID (Section 6.3). 4. Support for both local scope and global scope addressing (Section 6.4 and Section 6.5). Hadi Salim, et al. Expires April 25, 2004 [Page 8] Internet-Draft Netlink2 as ForCES Protocol October 2003 5. Netlink2 Message Format There are three levels to a Netlink2 message: The general Netlink2 message header which is mandatory, the Netlink2-extension TLV and service Template(s) which are optional. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Netlink2 message header | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Netlink2-extension TLV (optional) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Service Template(s) (optional) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Implementation studies [Goutaudier] have shown the above data layout to provide easier parsing while allowing for extensibility (via the optional Netlink2-extension TLV) and scalability (allowing for multiple Service templates). The Netlink2 message header is generic for all services and contains the command that describes the rest of the message. The optional Netlink2-extension TLV acts to extend any general missing functionality from the Netlink2 message header. Typically, this would be to allow for compensating for missing underlying transport functionality. The Service template is specific to a service. As mentioned earlier there could be more than one template per Netlink2 message. Each Service template carries configuration parameters or query requests (CPC->LFB direction) or query responses (LFB->CPC direction). In the case of multiple Service templates, then all the templates MUST be used to execute the same command as defined in the Netlink2 message header. In some special cases the Service template is not used. For example in the case of a Netlink2 SYN, FIN or NOOP command. 5.1 Netlink2 Message Header Hadi Salim, et al. Expires April 25, 2004 [Page 9] Internet-Draft Netlink2 as ForCES Protocol October 2003 Each Netlink2 message contains a byte stream with a Netlink2 header followed by its associated payload. A single PDU may contain more than one Netlink2 message. This is referred to as batching. Netlink batching is reused in Netlink2 and allows for messages with different commands (such as adding routes and deleting a QoS policy) to be carried in the same batch PDU. A Netlink2 message may be split across multiple PDUs if it does not fit into the PDU. This is referred to as a multipart Netlink2 message and is also inherited from Netlink. For multipart messages, the first and all following headers have the NLM_F_MULTI Netlink header flag set, except for the last header, which has the Netlink header type NLMSG_DONE. The Netlink2 message header is shown below. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version | Flags_E | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The fields in the header are: Version: 8 bits The version field is split into major:minor (4:4 bits) sub- fields. The value for Netlink2 is 0x20. Flags_E: 8 bits These are extended flags: NLM_F_PRIO: Message priority: 1 for high and 0 for low. Additional QoS level set in QoS TLV. Hadi Salim, et al. Expires April 25, 2004 [Page 10] Internet-Draft Netlink2 as ForCES Protocol October 2003 NLM_F_ASTR: Set the ACK strategy: 1 for partial ACKs and 0 for full ACKs NLM_F_MS: Multiple Service templates are present when this flag is set to 1 NLM_F_EXT: If this flag is set, it implies presence of the extended optional TLVs Length: 16 bits The length of the Netlink2 message in bytes including the header. Type: 16 bits This field describes the message content. It can be one of the standard message types: NLMSG_NOOP: message is not executed on LFnot executed on LFB NLMSG_ERROR the message signals an error and the payload contains a nlmsgerr structure. This can be looked at as a NACK and typically it is from LFB to CPC. NLMSG_DONE: message terminates a multipart message NLMSG_SYN: Sent on the first message. Interpreted as a boot message of the sender. NLMSG_FIN: Sent on the last message. Interpreted as a shutdown message of the sender. Typically, services specify more message types centered around transactional operations of adding, deleting or querying a command. For example, the NETLINK_ROUTE Service specifies several types for manipulating IPv4 or IPv6 routes such as RTM_NEWROUTE, RTM_DELROUTE, etc. Flags: 16 bits The standard flag bits used in Netlink are: NLM_F_REQUEST: Must be set on all request messages (typically from CE to FE) NLM_F_MULTI: Indicates the message is part of a multipart message terminated by NLMSG_DONE Hadi Salim, et al. Expires April 25, 2004 [Page 11] Internet-Draft Netlink2 as ForCES Protocol October 2003 NLM_F_ACK: Request for an acknowledgment on success. Typical direction of request is from CPC to LFB. NLM_F_ECHO: Echo this request. Typical direction of request is from CPC to LFB. Additional flag bits for GET requests on config information in the LFB: NLM_F_ROOT: Return the complete table instead of a single entry. NLM_F_MATCH: Return all matching criteria passed in message content NLM_F_ATOMIC: This is an atomic or part of an atomic operation (such as two-phase commit). Convenience macros for flag bits: NLM_F_DUMP: This is NLM_F_ROOT or'ed with NLM_F_MATCH Additional flag bits for NEW requests: NLM_F_REPLACE: Replace existing matching config object with this request. NLM_F_EXCL: Do not replace the config object if it already exists. NLM_F_CREATE: Create config object if it does not already exist. NLM_F_APPEND: Add to the end of the object list. For readers familiar with BSDish use of such operations in route sockets, the equivalent translations are: * BSD ADD operation equates to NLM_F_CREATE or-ed with NLM_F_EXCL * BSD CHANGE operation equates NLM_F_REPLACE * BSD Check operation equates NLM_F_EXCL * BSD APPEND equivalent is actually mapped to NLM_F_CREATE Sequence Number: 32 bits Hadi Salim, et al. Expires April 25, 2004 [Page 12] Internet-Draft Netlink2 as ForCES Protocol October 2003 The sequence number of the message. Source PID: 32 bits The PID of the sender of the message (unicast or logical PID). Destination PID: 32 bits The PID of the destination of the message (unicast, logical, or broadcast PID). 5.2 Type Length Value 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TLV Type | variable TLV Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Value (Data of size TLV length) | ~ ~ ~ ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ TLV Type: The TLV type field is two octets, and indicates the type of data encapsulated within the TLV. TLV Length: The TLV Length field is two octets, and indicates the length of this TLV including the TLV Type, TLV Length, and the TLV data. TLV Value: The TLV Value field carries the data. For extensibility, the TLV Value may be a TLV. In fact, this is the case with the Netlink2-extension TLV. The Value encapsulated within a TLV is dependent of the attribute being configured and is opaque to Netlink2 and therefore is not restricted to any particular type (example could be ascii strings such as XML, or OIDs etc). TLVs must be 32 bit aligned. Hadi Salim, et al. Expires April 25, 2004 [Page 13] Internet-Draft Netlink2 as ForCES Protocol October 2003 5.3 Encapsulated TLVs TLV values can be other TLVs. This gives the flexibility of being able to add new attributes when needed. This is important for a protocol such as ForCES for which attributes are expected to vary over a wide range of configurable blocks (CEs, FES, LFBs, etc). Note that Encapsulated TLVs could be viewed as abstractions that represent dynamic lists of attributes 5.4 Netlink2-extension TLVs The Netlink2-Extension and Service TLVs are Encapsulated TLVs. They contain their respective TLVs as appropriate in the message being sent. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Outer TLV Type | Outer TLV Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Inner TLV1 Type | Inner TLV1 Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ ~ ~~~~~~~~~~~~~~ VALUE1 ~~~~~~~~~~~~~~~~~~~~~~ ~ ~ ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ ~ ~ ~ ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Inner TLVn Type | Inner TLVn Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ ~ ~~~~~~~~~~~~~~ VALUEn ~~~~~~~~~~~~~~~~~~~~~~ ~ ~ ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Outer TLV Type: This is set to NL2_OPTIONS(0) to indicate the TLV is the Netlink2-Extension TLV. The rest of the possible value types are reserved for future use. Outer TLV Length: Hadi Salim, et al. Expires April 25, 2004 [Page 14] Internet-Draft Netlink2 as ForCES Protocol October 2003 The Outer TLV Length is the length of everything within the TLV including the Outer TLV Type field , Outer TLV Length, and all the encapsulated TLVs which are treated as the the Outer TLV Value. Outer TLV Value: The Outer TLV Value is all the inner TLVs. The figure above shows an outer TLV with n inner TLVs. Inner TLV type, Length, Value: These are all just normal TLVs. No assumption is made about their data contents. Hadi Salim, et al. Expires April 25, 2004 [Page 15] Internet-Draft Netlink2 as ForCES Protocol October 2003 6. Addressing and Transport Extensions We extend Netlink to make it distributed. The focus is on making Netlink2 have a strong local scope view of the world while fitting well into a global scope when the hop distance between the FE and CE increases. If the network interconnecting the FE(s) and CE(s) is completely hidden from the outside (black-box view), for instance an internal Ethernet segment or a switching fabric in which CE(s) and FE(s) are connected within physical proximity, then communications between FE and CE are assumed to be of a local scope. On the other hand, if communications between FE and CE cross several hops of the network then the scope is considered global 6.1 Transport Methods The ideal environment for Netlink2 is considered to be a multicast-capable medium with IP above it and with UDP/TCP/SCTP/DCCP running over IP. On the other hand, Netlink2 is also capable of running directly over L2 (Ethernet for example). In the case of non-IP, non-multicast-capable environment, extra processing and messaging by the ForCES layer to compensate for services that IP already offers would be needed (eg security, quality of service, fragmentation, etc if underlying transport does not have it). 6.1.1 Why Multicast? Multicast is considered important to facilitate one-to-many/some communication. For example, a single command from a CE can be multicast to multiple FEs, which eases the scalability requirements mentioned in [ForCES_REQ]. This is discussed in later sections. When running Netlink2 over non-multicast-capable media, it is expected that mechanisms similar to those used in OSPF NBMA [RFC2328] networks will be put in place. 6.1.2 Why IP? IP runs on virtually every link layer. Leveraging this fact alone helps deploying the protocol wider and faster. IP also provides numerous services such as fragmentation and reassembly, prioritization, and security, which are inherent Hadi Salim, et al. Expires April 25, 2004 [Page 16] Internet-Draft Netlink2 as ForCES Protocol October 2003 requirements for the ForCES protocol. This means that to successfully run an alternative to IP requires that similar services be provided by whatever is underneath in order to meet the requirements. Netlink2-specific optional TLVs can be used to compensate for lacking functionality if running on a network transport other than IP or directly on the link layer. Netlink already allows the definition of multipart messages with IP segmenting/reassembling when the path MTU is exceeded. When running on top of non-IP media, the Netlink2 message can be limited to not exceed the MTU; the multipart messages facility can be then be used to provide framing for segmenting/reassembling. The Netlink2-specific Authentication TLV can be used to carry authentication signatures over a transport that does not have this capability. The Netlink2-specific Checksum TLV can be used to carry checksums over a medium that does not have this capability. The Netlink2-specific Message Priority TLV can be used to carry prioritization if transports are not capable of making priorities in their headers. 6.1.3 Why UDP/TCP/SCTP/DCCP? On a local scope, it is assumed that multicast UDP over IP is the preferred mode of operation. On a global scope it is expected that TCP or SCTP would be used for enhanced reliability and Internet congestion friendliness. All mentioned protocols provide 16-bit ports, which are further address-demultiplexing points. Also, all three protocols provide checksum capability to enhance integrity of the Netlink2 message. In the case of UDP, the checksum is optional (which fits the model that the local scope is less error-prone than global scope and hence the integrity check could be turned on only when needed). 6.2 The Netlink2 wire and bundle A Netlink2 wire displays the same behavior as a Netlink wire. It interconnects FEs and CEs in order to support services they jointly offer. The only conceptual difference between a Netlink2 wire and a Netlink Hadi Salim, et al. Expires April 25, 2004 [Page 17] Internet-Draft Netlink2 as ForCES Protocol October 2003 wire is that whereas the Netlink wire is localized, the Netlink2 wire is distributed. We also introduce the concept of a Netlink2 bundle. A Netlink2 bundle interconnects a set of FE(s) and/or CE(s) by means of one or more Netlink2 wires. Note that a Netlink2 bundle does not necessarily mean a full-mesh interconnection (see examples later on). Parties (FEs and CEs) on a Netlink2 bundle share a common configuration, provisioning and event-notification end goals. A Netlink2 wire MAY be constructed using a multicast connection or a unicast connection or a multiple number of multicast and unicast connections. A wire MUST belong to only one bundle. A bundle may have only a single wire (unicast or multicast). In most cases we believe there will only be one multicast address for a bundle, although scalability issues could require the use of unicast connections in addition. When a multicast IP address is used, a Netlink2 wire MUST run over UDP - a UDP port is used to uniquely identify the wire. There MAY be multiple wires using the same multicast address as long as they run over different UDP ports. When a unicast IP address is used, the description of how to connect to an endpoint (CE/FE) is subject to the agreement between the CE and FE. The connection could be directly over IP (Note: need an IP protocol number) or via transport-layer ports (TCP/UDP/SCTP/DCCP). In both unicast and multicast wires, the necessary parameters (such as IP address and port numbers) can be discovered by the involvement of the FE and CE Managers. 6.2.1 What wires go in a bundle? Netlink2 provides flexibility to have a bundle of purely unicast wires or multicast wires or a hybrid of both. The decision of what goes into a bundle can be made in the pre-association phase. A good analogy is to think of a multicast wire as a broadcast link (as is done in Netlink) in which CE(s) and FE(s) are parties attached to that broadcast link. Depending on the number of FEs and CEs on an NE, a choice of a single multicast wire in the bundle may be sufficient. Multicast allows one-to-some messagging. A single message sent by an originator is seen by all parties on the wire. This simplifies synchronization in an HA environment as well as implementation of the protocol. Hadi Salim, et al. Expires April 25, 2004 [Page 18] Internet-Draft Netlink2 as ForCES Protocol October 2003 The fact that multicast messages are seen by all parties could cause scalability issues as the number of nodes grows. Parties need to filter out messages not destined to them. This can take compute or table resources if filtering is done in hardware. The extra messages also consume unnecessary bandwidth for FE(s) and CE(s) not interested in seeing these messages. Unicast wires could be used to create point-to-point connections between the parties; when every party is connected to every other party, then this becomes a full mesh. A full unicast mesh topology removes the need to filter the unnecessary messages but introduces scalability concerns as the number of connections required grows quadratically with the number of parties (FEs and CEs) present. This requires a lot more compute and state information to be maintained at each party. A pure mesh topology also complicates HA because more state must be maintained (for instance, the IP addresses of the CEs and FEs that are active and what their backups are) and therefore needs to perform extra processing to achieve failover. This becomes transparent if multicast is used among all parties. Netlink2 allows a bundle to have a hybrid of unicast and multicast connections. Note this is a model used by other protocols such as OSPF over broadcast links where the Hello protocol is multicast but responses to LSA updates are unicasted. We present some examples of Netlink2 bundles: 1. A trivial case is a Netlink2 bundle consisting of a single unicast wire between the CE and FE it interconnects. 2. Multiple FEs and a CE could be interconnected with a Netlink2 bundle using a single multicast connection. 3. In the same example as 2) above, the unicast address of the CE could in addition also be used, for instance, to deliver acknowledgments or notifications from the FEs to the CE, and not be seen by all other FEs. The unicast addresses of the FEs could also be used, for instance, to deliver certain messages only to a specific FE, such as a retransmission of a message in a two-phase commit only to an FE that did not respond. 4. Multiple FEs and CEs could use a wire with two multicast connections: one for all FEs, the other for all CEs, so that messages only relevant to FEs are not seen by CEs and vice-versa. Hadi Salim, et al. Expires April 25, 2004 [Page 19] Internet-Draft Netlink2 as ForCES Protocol October 2003 6.3 Redefining the Netlink PID Semantics We maintain the name PID for historical purposes and introduce a Destination PID and a Source PID as mentioned earlier. For every message received by each party on the wire, the destination PID field indicates the recipient of the message. The addressed party could be either a FE or a CE, respectively a LFB or a CPC. In addition to Netlink2 wires (unicast or multicast) defining the destination of a particular message delivered, the PID types provide further control, namely to define which entity actually has to process the message. So if the bundle uses only a single multicast wire, messages will be heard by all parties on the wire, but only those with a matching PID will actually process these messages. We introduce special- purpose PIDs addressed to specific listeners on the wire. The following types of PIDs are defined and can be used in the Netlink2 messages. The actual values for the PID of a FE or CE must be the same across all wires of the same bundle and must be established during the pre-association phase. Default values are given. PIDs must be unique within a Netlink2 wire. They may also be unique within the NE. PIDs are subdivided into two 16-bit subfields named wire and party in the form wire:party. 1. unicastPID: allows one to uniquely address a FE or CE. Each FE/ CE must have such a unicast PID. Only the FE or CE assigned to this PID must process an incoming message with such a Destination PID. Other parties MAY silently discard the message. The wire subfield is a unique identifier of the FE or CE. The party subfield acts as a port number: it can for instance be used to further demultiplex a message to the appropriate process in a CE (CPC) or the appropriate LFB in an FE. Default value: none. 2. logicalPID: in addition to unicastPID, a FE/CE MAY have zero or more logical PIDs assigned to it. A logicalPID can be used for active-backup pairs of FEs: for instance, the active and the backup FE have the same logical PID or at least the same wire subfield. The wire subfield is an identifier of the group of FEs and/or CEs participating in the group. Pre-association configuration ensures that the same party identifier is not assigned twice to different CPCs or LFBs on the same wire. Default value: none. Hadi Salim, et al. Expires April 25, 2004 [Page 20] Internet-Draft Netlink2 as ForCES Protocol October 2003 3. broadcastPID: all parties on all wires must process an incoming message with such a Destination PID. An example of a message that might be broadcast is when a CE is brought down for maintenance. Default value: 0xffffffff 4. FEbroadcastPID: all FEs on all wires must process an incoming message with such a Destination PID. Typically a route update from the CE to all FEs. Other parties (CEs) can silently discard the message. Default value: 0xffffefff 5. CEbroadcastPID: all CEs on all wires must process an incoming message with such a Destination PID. Other parties (FEs) can silently discard the message. Default value: 0xffffdfff A Netlink2 message must have as Destination PID one of the PIDs types defined above. The Source PID of a Netlink message must be of the unicastPID or logicalPID type. In addition, if the NLM_F_ACK flag is set, then every party processing the message MUST reply with an acknowledgment after processing the message, unless the NLM_F_ASTR flag is used to prevent ACK implosion. Pre-configured translation tables can be used to map a given PID into the underlying wire in a bundle, i.e., an IP unicast or multicast address. 6.4 Local Scope Addressing and Encapsulation At a local scope, the preferred addressing used for a wire is a UDP port on top of a multicast IP address. Multiple wires can run on one multicast address with further demultiplex level based on the UDP port. The wire addressing parameters MAY be discovered during the pre-association phase. 6.5 Global Scope Addressing and Encapsulation When addressing a non-local scope the Netlink2 message is encapsulated over a transport header and shuttled to the remote end where it is decapsulated and run as if originating from the local scope of that remote end. The global scope addressing could use any Hadi Salim, et al. Expires April 25, 2004 [Page 21] Internet-Draft Netlink2 as ForCES Protocol October 2003 transport protocol configured (SCTP, UDP, TCP or DCCP) as agreed upon in the pre-association phase. This can be viewed as extensions of the local scope wires. Hadi Salim, et al. Expires April 25, 2004 [Page 22] Internet-Draft Netlink2 as ForCES Protocol October 2003 7. Protocol Architecture 7.1 Protocol Phases ForCES in relation to NEs involves three phases: the Pre-Association phase, the association phase where the ForCES protocol operates, and a termination phase where a party in the relationship leaves a bundle. 7.1.1 The Pre-Association Phase In a simple setup, this phase is static. All the parameters for the association phase are well known (example multicast groups for each Netlink2 wire in a bundle, etc.). Vendors may use their own proprietary service discovery protocol. As minimum, we assume a static configuration. In fact, although ForCES mandates a minimal set of capability discovery, Netlink2 will also operate in a mode where such capability discovery is done in pre-association phase. In that case, the FE Manager and the CE Manager agree on all the parameters and clearly articulate topology and other information to each other in the pre-association phase. On completion of the Service Discovery phase, the FEM will have established contact with the appropriate CEM component. Initialization and Authentication will be complete at this point. Both the FE and CE know how to connect to each other for configuration, accounting, identification and authentication purposes. Both sides are also knowledgeable of all necessary protocol parameters such as timers, etc. All capabilities may also have been discovered at this point. 7.1.2 The Association Phase In this phase, the FE and CE components cooperate to deliver the IP service. The CE component might be registered (in the pre-association phase) to receive FE-specific services (such as link events). Essentially, in this phase, the service is provisioned and executing. The FE component might continuously get updates from the control plane component on how to operate the service (for example, the IPv4 forwarding route additions or deletions). The association phase is where Netlink2 operates as the ForCES protocol. On startup, the FE connects to the bundle(s) to which the CE is connected, using procedure defined in Section 8.3.1. The controlling CE will either admit the FE into the NE or reject it. Hadi Salim, et al. Expires April 25, 2004 [Page 23] Internet-Draft Netlink2 as ForCES Protocol October 2003 Once granted access into the NE, the FE is continously updated or queried. The FE may also send async event notifications to the CE. This continues until a termination is initiated by either the CE or FE. 7.1.3 Service Termination Service termination could be issued by either component of the service abstraction. FE or the CE initiating the termination will issue a FIN command 7.2 Protocol Logical Model In the diagram below we show a simple LFB-CPC logical relationship. We use the IPv4 Forwarding LFB as an example. CE----------------------------------- | /^^^^^\ /^^^^^\ | | | | / CPC-2 \ | | | CPC-1 | | COPS | | | | ospfd | | PEP | | | \ / \_____/ | | \_____/ | | | | | | ****************************************| ************* NETLINK2 BUNDLE *********** FE---------- *****************************************. | IPv4 Forwarding| | | | | LFBs | | | | | --------------/ ----|-----------|-------- | | | / | | | | | | .-------. .-------. .------. | | | | |ingress| | IPv4 | |Egress| | | | | |police | |Forward| | QoS | | | | | |_______| |_______| |Sched | | | | | ------ | | | --------------------------------------- | | | ----------------------------------------------------- Netlink2 logically models LFBs and CPCs in the form of service blocks interconnected to each other via a Netlink2 bundle. Acknowledgements and responses to messages do not have to be sent Hadi Salim, et al. Expires April 25, 2004 [Page 24] Internet-Draft Netlink2 as ForCES Protocol October 2003 onto the same wire from which the triggering messages came from but MUST be sent on the same bundle to the same originating PID. For instance, a wire interconnecting a CE with multiple FEs using a multicast address could be used to send route updates from the CE. On the other hand, independent unicast wires from each FE to the CE could be used to send back route events or acknowledgments. Note that sequencing is done per wire and Source PID, and ACKs can travel back on any wire of a bundle. The Netlink2 wire can be shared or be specific to a service. There can be multiple Netlink2 wires bundled in a bundle carrying messages of the same service. In order to reduce (for example to avoid extra processing) or restrict the messaging accessible for partitioning or security reasons, additional Netlink2 wires can be used. A possible partitioning is a Netlink2 bundle per service. In the example above the IPv4 Forwarding LFB would be considered a service. Assuming capabilities have been discovered during the pre-association phase (between the FEM and CEM), blocks (CPCs or LFBs as illustrated above) connect to the agreed wires on the Netlink2 bundle, and listen to receive specific messages. CPCs may connect to multiple Netlink2 wires if it helps them to control the service better. All blocks (CPCs and LFBs) dump packets on the Netlink2 wires. LFBs or CPCs join Netlink2 wires and listen to messages of interest for processing or monitoring purposes. All messages addressed to the LFB (for example the IPv4 forwarding LFB illustrated above) will have the FE PID agreed upon by both the CE and the FE at the pre-association phase. LFBs (as well as CPCs) also process messages with the broadcast PIDs. They may also process messages destined to other LFBs (as well as CPCs) for availability synchronization purposes. A further demultiplexing point is the command type in the Netlink2 message. Each of the LFBs (e.g., the ingress police LFB above) knows how to respond to a specific command-set as defined by the Netlink2 message type. 7.3 Service Addressing Connecting to a service is achieved by connecting to a defined Netlink2 bundle by both the CPC and LFB. This Netlink2 bundle is derived in the pre-association phase. A service would typically be constrained to a specific Netlink2 bundle. Hadi Salim, et al. Expires April 25, 2004 [Page 25] Internet-Draft Netlink2 as ForCES Protocol October 2003 Connecting to a service is followed (at any point during the lifetime of the connection) by either issuing a service-specific command mostly for configuration purposes (from the CPC to the LFB) or for statistics collection. The LFB could also send event announcements to the CPC or respond or queries issued by the CPC. 7.4 Service Templates LFBs throw events and are configured and queried by using service templates. Refer to the Netlink document [RFC3549] as well as Section 8.4 for the different templates used for different LFBs that fit within the current scope of the ForCES charter. 7.5 Mechanisms for Creating Protocols Mechanisms for reliable or non-reliable protocols creation are provided. In addition, mechanisms for facilitating availability are embedded in Netlink2. 7.5.1 Building Reliable Protocols By default the Netlink2 header flags NLM_F_PRIO and NLM_F_ACK are not set so that Netlink2 messages are sent with a lower priority and do not require acknowledgements. One could create a reliable protocol between an LFB and a CPC by using the combination of sequence numbers, ACKs and retransmit timers. Both sequence numbers and ACKs are provided by Netlink2. Timers are provided by the operating system or hardware. Prioritization is an orthogonal mechanism to reliability. When a node runs out of resources, a message sent with a higher priority will get preferential treatment. For instance, if a FE has only enough memory to allocate one message in response to a message from the CE and it has to choose between one of two messages to respond to, then it will use that memory for the request which was sent with the higher priority. This also applies to other resources such as computing cycles and bandwidth. In other words, the NLM_F_PRIO is more than only the classical bandwidth prioritization of packets on a link. Another orthogonal mechanism provided by Netlink2 is the ACK strategy which is selected by the NLM_F_ASTR flag. We define two types of acknowledgement strategies: Hadi Salim, et al. Expires April 25, 2004 [Page 26] Internet-Draft Netlink2 as ForCES Protocol October 2003 1. partial ACKs (using multicast ACK slotting and damping techniques [XTP]): receivers multicast an ACK after a random time if they have not yet seen an ACK sent by another receiver. This limits the number of ACKs returned to the source of the message and improves performance. For messages which a CE sends to a group of FEs partial ACKs imply that anyone of the FEs generating an ACK back is sufficient to deem the message was delivered. 2. full ACKs: each receiver sends an ACK back to the source. This allows the source to immediately detect problems with receivers. In two-phase commits it is important that all FEs respond so that the full ACKs strategy should be used. 7.5.2 Building Availability A protocol component or an application could passively listen to Netlink2 commands and events within one or several Netlink2 wires. Doing so allows a very simple way of building complex applications which are aware of all service components that affect them for HA reasons. To ensure transparent CE or FE redundancy for certain services, it is sufficient to ensure that the backup CPC/LFB is always attached to the same wires to which the active CPC/LFB is attached, so that the backup CPC/LFB receives all messages destined to the active CPC/LFB (whatever PID they are sent to) as well as all messages originating from the active CPC/LFB. One could create a heartbeat protocol between the LFB and CPC by using the ECHO flags and the NLMSG_NOOP message(Section 8.3.5). The heartbeat, in addition to listening to FE or CE events, could be used to facilitate takeover. This topic is beyond the scope of ForCES and will not be discussed further here. Note, however, that Netlink2 has the mechanisms required to enable this when required. 7.5.3 The ACK Netlink2 Message This message is actually used to denote both an ACK and a NACK. Typically the direction is from LFB to CPC (in response to an ACK request message). However, CPC should be able to send ACKs back to LFB when requested. The semantics for this are IP service specific. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Hadi Salim, et al. Expires April 25, 2004 [Page 27] Internet-Draft Netlink2 as ForCES Protocol October 2003 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Netlink2 message header | | type = NLMSG_ERROR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | error code | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OLD Netlink2 message header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Error code: integer (typically 32 bits) An error code of zero indicates that the message is an ACK response. An ACK response message contains the original Netlink2 message header that can be used to compare against (sent sequence numbers, etc). A non-zero error code message is equivalent to a Negative ACK (NACK). In such a situation, the Netlink2 data that was sent down to the kernel is returned appended to the original Netlink2 message header. 7.5.4 Batching As mentioned earlier (repeated here for clarity) Standard Netlink multi-message batching looks as follows: NLMSG:NLMSG:NLMSG.... where NLMSG is a Netlink2 header and its associated payload. This has the advantage of allowing inter-mixing of multiple commands (example adds/deletes) generally in a request from CE->FE. It is also useful for batching multiple events from the FE->CE. Additionally, studies from [Goutaudier] have motivated batching of Service Templates within a single Netlink2 messages. Recall, a Netlink2 message looks like: NLMSGHDR:OET:ST where NLMSGHDR is a Netlink2 header, OET is the optional extension TLVs and ST is the service template. The template extension now looks like: NLMSGHDR:OET:ST:ST:ST..... In other words there are multiple service templates that can fit Hadi Salim, et al. Expires April 25, 2004 [Page 28] Internet-Draft Netlink2 as ForCES Protocol October 2003 within the same message. There are caveats with such a batching scheme since only one ACK may be sent for a whole batch, it implies that it is difficult to know which service configuration failed. In a close proximity, low error rate link batching in this mode should allow for high throughputs for configurations while reducing the number of ACKs back. 7.5.5 Atomicity and Ordering of Transactions In a two-phase commit messages are bound into a relationship. The first and all following headers have the NLM_F_MULTI Netlink2 header flag set, except for the last header, which has the Netlink2 header type NLMSG_DONE. Typically, in netlink, the NLMSG_DONE shows up in separate PDUs to define a commit. Atomicity of a transaction including that of a batch is achieved by using the NLM_F_ATOMIC flag. Use of the NLM_F_ATOMIC is expensive because it may necessitate the locking of access to tables (depending on the implementation. Hadi Salim, et al. Expires April 25, 2004 [Page 29] Internet-Draft Netlink2 as ForCES Protocol October 2003 8. Putting together the base protocol for WG charter The design approach taken for Netlink2 protocol is to avoid over featuring the protocol and focus on the requirements under the current WG charter. Although Netlink2 could be used for CE-CE or FE-FE communication this is not discussed in this document to avoid complexity. Additionaly although Netlink2 provides the minimal required attribute discovery, it will work with existing proprietary or open protocols which exist to discover such attributes. 8.1 Netlink2-Extension TLVs Netlink2-Extension TLVs are mostly used to compensate for the underlying transport not having mechanisms needed by Netlink2. 8.1.1 Authentication [TBD] 8.1.2 Checksum 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TLV Type = NL2_CSUM | TLV Length = 2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum (16 bits) | Alignment Padding (16 bits) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This TLV is optional. To compute the correct checksum, an implementation MUST add the optional checksum TLV to the Netlink2 message with the initial checksum value of 0 and compute the checksum over such a Netlink2 message. Refer to [RFC3358] for details on the Checksum TLV. 8.1.3 Message Priority 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TLV Type = NL2_MPRIO | TLV Length = 2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum (16 bits) | Alignment Padding (16 bits) | Hadi Salim, et al. Expires April 25, 2004 [Page 30] Internet-Draft Netlink2 as ForCES Protocol October 2003 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This TLV is optional. It is used if the network does not support prioritization. This field is used to indicate priorities to the remote end. 8.1.4 SYN COOKIE TBF TLV_TYPE = NL2_COOKIE. 8.1.5 Name ID 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = NL2_NAMEID | TLV Length = variable | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | size of name | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This TLV is optional. It is used to identify a name that a CE or FE wishes to be known as. Typically exchanged with SYN messages. 8.2 LFB and FE Attributes and discovery In the association phase the CE queries the FE to determine its capabilities. These may include the FE-FE topology, the initial LFB topology for the FE, constraints on how the LFB topology can be modified (if possible), etc. A schema for representing FE and LFB attributes and capabilities is being defined in [ForCES_Model]. Appropriate Netlink2 TLVs will be defined to convey the identified parameters as the model work progresses. 8.3 NE creation The FE and CE Managers communicate to decide communication parameters and rules that are to be used in the transaction between the CE and FE. Using the agreed on parameters, the FE attempts to join the NE. The CE may reject the FE or allow it to join. The FE then communicates to Hadi Salim, et al. Expires April 25, 2004 [Page 31] Internet-Draft Netlink2 as ForCES Protocol October 2003 the FEM to inform it of the decision. Note that we do not discuss the FE-FEM or CE-CEM interfaces in this document as it is beyond the scope of ForCES. 8.3.1 FE State transitions SYN retran. .-->-. ^ Y | | ^ Y \ Y send SYN +---------+ recvd SYN|ACK +--->----->----->---------->|SYN_SENT |---->>>----+ | +------<---------<----| | Y +------+--+ | recvd NACK or | state | +--------+ | INIT |<-+ max retransmit +---------+ | EST | | | | State | | State |<-+ +---------+ | | +---------+ | recvd FIN|ACK |FIN_SENT | +--------+ ^ +----<---<----------<-| | Y Y | | State |--<-<--+ | ^ +---------+ Send FIN Y | ^ Y | | | | | | +-<--+ | | FIN | | retrans | | | | recvd FIN|ACK or recvd SYN broadcast Y +-<---<---------<-------<---------<-------<-------------+ INIT state: When the FE is started (by FE manager or otherwise) it goes into the INIT state. At this point the FE has been informed by the FE Manager of the following (based on current implementation): o the bundle to join, o its PID, o the PID of the CE, Hadi Salim, et al. Expires April 25, 2004 [Page 32] Internet-Draft Netlink2 as ForCES Protocol October 2003 o the number of retries for the SYN transmission and the SYN timer, o and the number of retries for the FIN transmission and the FIN timer value. The FE Manager would also instruct the FE to be either active or passive. Although this is beyond Forces charter, the active/passive setup description is introduced here to describe one way to achieve redundancy. Netlink2 does not mandate how redundancy is achieved. Netlink2 imposes that FE redundancy is the role of the FE plane as such netlink2 is designed so that the CE has no knowledge of FE redundancy. This greatly simplifies the protocol. After internal initialization, the FE sends a SYN message with the ACK flag on. The message will contain Netlink2-extension TLV of type NL2_NAMEID. The NL2_NAMEID TLV will contain the name the FE wishes to be known as. The FE then enters the SYN_SENT state. A FE could passively monitor the state of one or more FEs and synchronizes their state and communication data with the CE. The end goal of a passive FE is to act as a backup for the FE whose activities it is monitoring. The monitoring is trivial to achieve if multicast is used. The synchronization may also happen via a FE-FE protocol or via the FE Manager. A passive FE may be called on by the FE manager to take over the functionality of the FE it is monitoring. SYN_SENT state: The FE fires the SYN timer and waits for a response from the CE. Two events could happen: 1. The timer expires. If the number of retries has not reached the maximum allowed value, then the SYN is retransmitted and timer restarted. If the maximum number of retries has been reached with the last SYN transmission then the FE notifies the FE manager and goes into INIT state. 2. a packet is received from the CE: * A NACK packet to the sent SYN packet. Action: cancel the timer, inform the FE manager on the rejection reasons and go into INIT state. * an ACK packet to the sent SYN packet. Action: update the FE manager and go into EST state. EST state: Hadi Salim, et al. Expires April 25, 2004 [Page 33] Internet-Draft Netlink2 as ForCES Protocol October 2003 This is the established state where normal Forces communication starts. Several events may force the FE to transition out of the EST state: 1. the FE manager requests it to. In this case the FE will issue a FIN with an ACK request to the CE and transition to the FIN_SENT state. 2. The CE asks it to leave. This is considered a reset of the FE. The FE receives a FIN from the CE to inform it to leave. The FE immediately informs the FE manager, sends a FIN and goes into INIT state. 3. The CE restarts and sends a broadcast SYN. This may be caused by either the CE manager restarting the CE to clear its state or a result of the CE dying and being restarted. Control of restarting of the CE and association to the CE manager is out of scope for ForCES. Upon receiving the broadcast SYN, the FE assumes the CE has no knowledge of any state the FE is in and transits into the INIT state after informing the FE manager. Additionaly not discussed here are optional heartbeats from the CE to FE. If the CE doesnt see heartbeats after a timeout period then the transition to the INIT state will be made. FIN_SENT state: Two events could happen: 1. The timer expires. If the number of retries has not reached the maximum allowed value then the FIN is retransmitted and timer restarted. If the maximum number of retries has been reached with the last FIN transmission then the FE notifies the FE manager and goes into INIT state. 2. a valid FIN|ACK packet is received from the CE. Action: cancel the timer, inform the FE manager and go into INIT state. 8.3.2 CE view of FE State transitions This is per FE information on the CE side. wait Hadi Salim, et al. Expires April 25, 2004 [Page 34] Internet-Draft Netlink2 as ForCES Protocol October 2003 for FE .->-. ^ Y | | recvd SYN +---------+ setup complete ^ Y +->----->----->---------->|SYN_RCVD |---->>>----+ \ Y | | | Y +---------+ | state | +--------+ | INIT | +---------+ | EST | | | | State | | State |<-+ +---------+ | | +---------+ | recvd FIN|ACK | FIN_SENT| +--------+ ^ +----<---<----------<-| | Y Y | | State |--<-<--+ | ^ +---------+ Send FIN Y | ^ Y | | | | | | +-<--+ | | FIN | | retrans | | | | recvd FIN|ACK or recvd SYN Y +-<---<---------<-------<---------<-------<-------------+ INIT state: When the CE Manager informs the CE of a FE, basic state information is created for the FE and it is placed into the INIT state. At this point the CE has been informed by the CE Manager of the following: o the bundle the FE will join, o its PID that the FE is going to use to refer to tthe CE, o the unicast PID of the FE. o the number of retries for the SYN transmission and the SYN timer o the number of retries for the FIN transmission and the FIN timer value. o the expected timeouts before the FE joins and number of such timeout to wait for the FE. o whether the FE is interested in restart information if available (refer to the FIN_SENT state) The CE fires a timer waiting for the FE to join. Two things could Hadi Salim, et al. Expires April 25, 2004 [Page 35] Internet-Draft Netlink2 as ForCES Protocol October 2003 happen: 1. The timer expires. If the number of retries for waiting for the FE to join has not reached the maximum allowed value then the timer is restarted. If the maximum number of retries is reached then the CE deletes the FEs state info and informs the manager. 2. A valid SYN packet is received from the FE. The CE transitions into the SYN_RCVD state. SYN_RCVD state: In this state the CE will do any necessary processing to prepare for the FE to be admitted into the NE. The CE issues a SYN|ACK and moves into the EST state. EST state: This is the established state where normal Forces communication starts. Several events may force the CE to transition out of the EST state: 1. the CE manager requests it to. In this case the CE will issue a FIN with an ACK request to the FE and transition to the FIN_SENT state. 2. The FE leaves. This is considered a reset of the FE. The FE sends a FIN to the CE to inform it it is leaving. The CE immediately sends a FIN ACK and notifies the CE manager. Transition is made to the INIT state. Not discussed here is use of hearbeats or other events (eg link down ) to transition to the INIT state on discovery that the FE is dead. FIN_SENT state: The CE fires the FIN timer and waits for a response from the FE. Two events could happen: 1. The timer expires. If the number of retries has not reached the maximum allowed value then the FIN is retransmitted and timer restarted. If the maximum number of retries has been reached with the last FIN transmission then the CE notifies the CE manager and goes into INIT state. 2. a valid FIN|ACK packet is received from the FE: Hadi Salim, et al. Expires April 25, 2004 [Page 36] Internet-Draft Netlink2 as ForCES Protocol October 2003 * cancel the timer, inform the CE manager * transition to the INIT state. For states that transition to the init state observe that if the FE comes back and joins before the FE expiry time, its LFB state(s) would still be intact and maybe resent to it (The restart policy is agreed on at pre-association time). OTOH, the state will be garbage collected if no SYNs from the FE are seen within the period (or if they are new ones seen but FEM-CEM interface indicates no interest in the restart data). 8.3.3 SYN Message Format A SYN message contains a base Netlink2 header (refer to Section 5.1) with the appropriate flags followed by the Extension TLV Name ID (refer to Section 8.1.5). The Name ID will have the name the FE wishes to be refered to. 8.3.4 FIN Message Format A FIN message contains a base Netlink2 header with the appropriate flags (refer to Section 5.1). 8.3.5 NOOP Message Format A NOOP message contains a base Netlink2 header with the appropriate flags (refer to Section 5.1) set. The NOOP carries no execution message and therefore no operations on LFBs are carried out as a result of receiving it. The flags of the message are still relevant. A standard use of NOOP message is for heartbeats. A CE may send LFBs keepalive messages using NLMSG_NOOP command. When requesting for replies, the CE sets the NLM_F_ECHO flag on to get the message sent back to it as is (essentially loopback of exact same message sans the ECHO flag). 8.4 LFB and FE Service Templates In this section we describe Service Templates used to configure FEs and LFBs as well as for async event notification as required by the ForCES WG charter. Some of these message templates are already described in the Netlink document ([RFC3549]) but are repeated here for clarity. Hadi Salim, et al. Expires April 25, 2004 [Page 37] Internet-Draft Netlink2 as ForCES Protocol October 2003 A feature of Netlink2 is that the same message template is used in configuration, querying or events. In the CE->FE direction configuration commands embedding Service Templates described in this section are used to configure (Add or delete a policy for example). In the FE->CE direction, the templates are used to give back query responses or throw events at the CE (on a per-LFB basis). As noted earlier, a single Netlink2 message may carry multiple service templates if the NLM_F_MS flag is set. This is not restricted to the config (CE->FE) only but also extends to responses or events (FE->CE). 8.4.1 Physical Port and Address Functions [TBF] 8.4.1.1 Interface Service Template This is very close to what the Port LFB is defined to be in the Model draft. Its expressive semantics are sufficient to define a physical port (regardless of the underlying physical links), virtual interface, etc. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Family | Reserved | Device Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Device Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Change Mask | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Family: 8 bits This is always set to AF_UNSPEC. Device Type: 16 bits This defines the type of the link. The link could be Ethernet, PCI, a tunnel, etc. Interface Index: 32 bits Uniquely identifies interface. Device Flags: 32 bits IFF_UP Interface is administratively up. Hadi Salim, et al. Expires April 25, 2004 [Page 38] Internet-Draft Netlink2 as ForCES Protocol October 2003 IFF_BROADCAST Valid broadcast address set. IFF_DEBUG Internal debugging flag. IFF_LOOPBACK Interface is a loopback interface. IFF_POINTOPOINT Interface is a point-to-point link. IFF_RUNNING Interface is operationally up. IFF_NOARP No ARP protocol needed for this interface. IFF_PROMISC Interface is in promiscuous mode. IFF_NOTRAILERS Avoid use of trailers. IFF_ALLMULTI Receive all multicast packets. IFF_MASTER Master of a load balancing bundle. IFF_SLAVE Slave of a load balancing bundle. IFF_MULTICAST Supports multicast. IFF_PORTSEL Is able to select media type via ifmap. IFF_AUTOMEDIA Auto media selection active. IFF_DYNAMIC Interface was dynamically created. Change Mask: 32 bits Reserved for future use. Must be set to 0xFFFFFFFF. Applicable attributes: IFLA_UNSPEC Unspecified. IFLA_ADDRESS Hardware address interface L2 address. IFLA_BROADCAST Hardware address L2 broadcast address. IFLA_IFNAME ASCII string device name. IFLA_MTU MTU of the device. IFLA_LINK ifindex of link to which this device is bound. IFLA_QDISC ASCII string defining egress root queuing discipline. IFLA_STATS Interface statistics. Hadi Salim, et al. Expires April 25, 2004 [Page 39] Internet-Draft Netlink2 as ForCES Protocol October 2003 Netlink message types specific to this service: RTM_NEWLINK, RTM_DELLINK, and RTM_GETLINK 8.4.1.2 Address Service Template The expressive semantics of this template are sufficient to define addressing for a port LFB (physical or virtual interfaces) including secondary addresses. Although the focus is on IPv4 and IPv6, the template could be used to configure IPX etc. We only focus on IP. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Family | Length | Flags | Scope | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Family: 8 bits Address Family: AF_INET for IPv4; and AF_INET6 for IPv6. Length: 8 bits The length of the address mask. Flags: 8 bits IFA_F_SECONDARY For secondary address (alias interface). IFA_F_PERMANENT For a permanent address set by the user. When this is not set, it means the address was dynamically created (e.g., by stateless autoconfiguration). IFA_F_DEPRECATED Defines deprecated (IPv4) address. IFA_F_TENTATIVE Defines tentative (IPv4) address (duplicate address detection is still in progress). Scope: 8 bits The address scope in which the address stays valid. SCOPE_UNIVERSE: Global scope. SCOPE_SITE (IPv6 only): Only valid within this site. SCOPE_LINK: Valid only on this device. SCOPE_HOST: Valid only on this host. Applicable attributes: IFA_UNSPEC Unspecified. IFA_ADDRESS Raw protocol address of interface. Hadi Salim, et al. Expires April 25, 2004 [Page 40] Internet-Draft Netlink2 as ForCES Protocol October 2003 IFA_LOCAL Raw protocol local address. IFA_LABEL ASCII string name of the interface. IFA_BROADCAST Raw protocol broadcast address. IFA_ANYCAST Raw protocol anycast address. IFA_CACHEINFO Cache address information. Netlink messages specific to this service: RTM_NEWADDR, RTM_DELADDR, and RTM_GETADDR. 8.4.2 IPv4 and IPv6 L3 Forwarding Functions In this section we describe two LFB templates necessary for IPv4 and V6 L3 forwarding control. 8.4.2.1 IPv4 and IPv6 Forwarding LFB Template The expressive semantics of this template are sufficient to describe any IPv4 or IPv6 route configuration including ability to express route entries for virtual routers within a physical router. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Family | Src length | Dest length | TOS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Table ID | Protocol | Scope | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Family: 8 bits Address Family: AF_INET for IPv4; and AF_INET6 for IPv6. Src length: 8 bits Prefix length of source IP address. Dest length: 8 bits Prefix length of destination IP address. TOS: 8 bits The 8-bit TOS (should be deprecated to make room for DSCP). Table ID: 8 bits Table identifier. Up to 255 route tables are supported. RT_TABLE_UNSPEC An unspecified routing table. RT_TABLE_DEFAULT The default table. RT_TABLE_MAIN The main table. Hadi Salim, et al. Expires April 25, 2004 [Page 41] Internet-Draft Netlink2 as ForCES Protocol October 2003 RT_TABLE_LOCAL The local table. The user may assign arbitrary values between RT_TABLE_UNSPEC(0) and RT_TABLE_DEFAULT(253). Protocol: 8 bits Identifies what/who added the route. Protocol Route origin. .............................................. RTPROT_UNSPEC Unknown. RTPROT_REDIRECT By an ICMP redirect. RTPROT_KERNEL By the kernel. RTPROT_BOOT During bootup. RTPROT_STATIC By the administrator. Values larger than RTPROT_STATIC(4) are not interpreted by the kernel, they are just for user information. They may be used to tag the source of a routing information or to distinguish between multiple routing daemons. Scope: 8 bits Route scope (valid distance to destination). RT_SCOPE_UNIVERSE Global route. RT_SCOPE_SITE Interior route in the local autonomous system. RT_SCOPE_LINK Route on this link. RT_SCOPE_HOST Route on the local host. RT_SCOPE_NOWHERE Destination does not exist. The values between RT_SCOPE_UNIVERSE(0) and RT_SCOPE_SITE(200) are available to the user. Type: 8 bits The type of route. Route type Description ---------------------------------------------------- RTN_UNSPEC Unknown route. RTN_UNICAST A gateway or direct route. RTN_LOCAL A local interface route. RTN_BROADCAST A local broadcast route (sent as a broadcast). RTN_ANYCAST An anycast route. RTN_MULTICAST A multicast route. RTN_BLACKHOLE A silent packet dropping route. RTN_UNREACHABLE An unreachable destination. Packets dropped and host unreachable ICMPs are sent to the Hadi Salim, et al. Expires April 25, 2004 [Page 42] Internet-Draft Netlink2 as ForCES Protocol October 2003 originator. RTN_PROHIBIT A packet rejection route. Packets are dropped and communication prohibited ICMPs are sent to the originator. RTN_THROW When used with policy routing, continue routing lookup in another table. Under normal routing, packets are dropped and net unreachable ICMPs are sent to the originator. RTN_NAT A network address translation rule. RTN_XRESOLVE Refer to an external resolver (not implemented). Flags: 32 bits Further qualify the route. RTM_F_NOTIFY If the route changes, notify the user. RTM_F_CLONED Route is cloned from another route. RTM_F_EQUALIZE Allow randomization of next hop path in multi-path routing (currently not implemented). Attributes applicable to this service: Attribute Description --------------------------------------------------- RTA_UNSPEC Ignored. RTA_DST Protocol address for route destination address. RTA_SRC Protocol address for route source address. RTA_IIF Input interface index. RTA_OIF Output interface index. RTA_GATEWAY Protocol address for the gateway of the route RTA_PRIORITY Priority of route. RTA_PREFSRC Preferred source address in cases where more than one source address could be used. RTA_METRICS Route metrics attributed to route and associated protocols (e.g., RTT, initial TCP window, etc.). RTA_MULTIPATH Multipath route next hop's attributes. RTA_PROTOINFO Firewall based policy routing attribute. Hadi Salim, et al. Expires April 25, 2004 [Page 43] Internet-Draft Netlink2 as ForCES Protocol October 2003 RTA_FLOW Route realm. RTA_CACHEINFO Cached route information. Additional Netlink message types applicable to this service: RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE 8.4.2.2 Neighbor Discovery LFB Template The expressive semantics for this config are sufficient to describe both IPv4 neighbor resolution via ARP or IPv6 neighbor discovery (RFC2461). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Family | Reserved1 | Reserved2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | State | Flags | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Family: 8 bits Address Family: AF_INET for IPv4; and AF_INET6 for IPv6. Interface Index: 32 bits The unique interface index. State: 16 bits A bitmask of the following states: NUD_INCOMPLETE Still attempting to resolve. NUD_REACHABLE A confirmed working cache entry NUD_STALE an expired cache entry. NUD_DELAY Neighbor no longer reachable. Traffic sent, waiting for confirmation. NUD_PROBE A cache entry that is currently being re-solicited. NUD_FAILED An invalid cache entry. NUD_NOARP A device which does not do neighbor discovery (ARP). NUD_PERMANENT A static entry. Flags: 8 bits NTF_PROXY A proxy ARP entry. NTF_ROUTER An IPv6 router. Hadi Salim, et al. Expires April 25, 2004 [Page 44] Internet-Draft Netlink2 as ForCES Protocol October 2003 Attributes applicable to this service: NDA_UNSPEC Unknown type. NDA_DST A neighbour cache network layer destination address NDA_LLADDR A neighbor cache link layer address. NDA_CACHEINFO Cache statistics. Additional Netlink message types applicable to this service: RTM_NEWNEIGH, RTM_DELNEIGH, and RTM_GETNEIGH. 8.4.3 Filtering Functions TBF 8.4.4 QoS Functions TBF 8.4.5 IPSEC Functions TBF 8.4.6 Packet redirection Functions TBF 8.4.7 Packet Mirroring Functions TBF 8.4.8 Packet Sampling Functions TBF 8.5 Security Considerations CEs may communicate vital and possibly confidential information to FEs via the ForCES protocol. For example, such information can be filtering rules or secret encryption keys. In addition, the ForCES protocol should not open new possibilities for Denial of Service attacks. A single box environment is an interconnect between CEs and FEs that can be physically secured. ForCES messages coming on physical ports not part of the interconnect are dropped. In such an environment, protection is required only against data-packet-based Hadi Salim, et al. Expires April 25, 2004 [Page 45] Internet-Draft Netlink2 as ForCES Protocol October 2003 DoS attacks. A multi-hop environment places more requirements in terms of security. Protection against Netlink2-SYN-flood attack becomes necessary. In addition, some or all of the ForCES messages may have to be authenticated or encrypted. 8.5.1 Denial of Service (DoS) attacks Preventing DoS attacks resulting from data packets redirected by the FE to the CE can be achieved by shaping according to configurable parameters such as a maximum rate. A data-packets DoS-resistant FE MUST therefore support the necessary LFBs that permit to place policers that shape traffic redirected to the CE by an FE. Preventing DoS attacks at the ForCES protocol level (such as Netlink2 SYN flood) may be necessary if the underlying transport protocol is not resistant to such attacks. This can be the case if UDP is used, for instance. In the case of TCP and SCTP, cookie-based mechanisms already exist to prevent SYN flood DoS attacks (refer to the respective RFCs and [TCP-SYN-COOKIES]). A SYN-flood DoS-resistant FE or CE MUST therefore support a Netlink2-Extension Cookie TLV (TLV_TYPE = NL2_COOKIE). This Cookie TLV is placed in the ACK message that acknowledges a SYN message. This Cookie TLV MUST be returned as is in the SYNACK message. (Note: content and length of the Cookie TLV remain to be standardized, if necessary). 8.5.2 Authentication and Encryption To perform authentication, the necessary information may be configured statically, such as shared secrets or public and private keys. On the other hand, in a dynamic environment, public keys may have to be distributed using certificates. Such certificates must contain names that are uniquely and permanently assigned to CEs and FEs. Addresses used for routing ForCES messages may change and are not suitable for that purpose. ForCES qualified names (Note: this needs to be defined in a draft of its own) MUST be used similarly to iSCSI qualified names [iSCSI-NAMING]. Hadi Salim, et al. Expires April 25, 2004 [Page 46] Internet-Draft Netlink2 as ForCES Protocol October 2003 References [Diffserv] "Linux Diffserv", . [ForCES_Model] Yang, L., Halpern, J., Gopal, R., DeKok, A., Haraszti, Z. and S. Blake, "ForCES Forwarding Element Model", October 2003, < . >. [ForCES_REQ] Khosravi, H. and T. Anderson, "Requirements for Separation of IP Control and Forwarding", October 2003, . [Goutaudier] Goutaudier, G., "Enhancements and Prototype Implementation of the ForCES Netlink2 Protocol, IBM Research Report RZ3482", September 2003, . [Netfilter] "Linux Netfilter", . [RFC1157] Case, J., Fedor, M., Schoffstall, M. and C. Davin, "Simple Network Management Protocol (SNMP)", May 1990, . [RFC1633] Braden, R., Clark, D. and S. Shenker, "Integrated Services in the Internet Architecture: an Overview", June 1994, . [RFC1812] Baker, F., "Requirements for IP Version 4 Routers", June 1995, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2328] Moy, J., "OSPF Version 2", April 1998, . [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Weiss, W. and Z. Wang, "An Architecture for Differentiated Services", December 1998, . Hadi Salim, et al. Expires April 25, 2004 [Page 47] Internet-Draft Netlink2 as ForCES Protocol October 2003 [RFC2748] Boyle, J., Cohen, R., Durham, D., Herzog, S., Rajan, R. and A. Sastry, "The COPS (Common Open Policy Service) Protocol", January 2000, . [RFC2844] Przygienda, T., Droz, P. and R. Haas, "OSPF over ATM and Proxy-PAR", May 2000, . [RFC3036] Andersson, L., Doolan, P., Feldman, N., Fredette, A. and B. Thomas, "LDP Specification", January 2001, . [RFC3292] Doria, A., "General Switch Management Protocol (GSMP) V3", June 2002, . [RFC3358] Przygienda, T., "Optional Checksums in Intermediate System to Intermediate System (ISIS)", August 2002, . [RFC3549] Hadi Salim, J., Khosravi, H., Kleen, A. and A. Kuznetsov, "Linux Netlink as an IP Services Protocol", July 2003, . [Stevens] Wright, G. and W. Stevens, "TCP/IP Illustrated Volume 2, Chapter 20", June 1995. [TCP-SYN-COOKIES] Dan, D., "SYN cookies", 1997, . [XTP] "Xpress Transport Protocol Specification, XTP Revision 4.0", March 1995. [iSCSI-NAMING] "iSCSI Naming and Discovery, draft-ietf-ips-iscsi-name-disc-10.txt", June 2003, . Hadi Salim, et al. Expires April 25, 2004 [Page 48] Internet-Draft Netlink2 as ForCES Protocol October 2003 Authors' Addresses Jamal Hadi Salim Znyx Networks 195 Stafford Rd. West Ottawa, Ontario Canada EMail: hadi@znyx.com Robert Haas IBM Research Zurich Research Laboratory Saeumerstrasse 4 CH-8803 Rueschlikon, Switzerland EMail: rha@zurich.ibm.com Steven Blake Ericsson 920 Main Campus Drive, Suite 500 Raleigh, NC 27606 USA EMail: steven.blake@ericsson.com Hadi Salim, et al. Expires April 25, 2004 [Page 49] Internet-Draft Netlink2 as ForCES Protocol October 2003 Appendix A. Sample Service Hierarchy In the diagram below we show a simple IP service, foo, and the interaction it has between CP and FE components for the service(labels 1-3). The diagram is also used to demonstrate CP< - >FE addressing. In this section we illustrate only the addressing semantics. In Appendix 2 , the diagram is referenced again to define the protocol interaction between service foo's CPC and LFB (labels 4-10). CP [--------------------------------------------------------. | .-----. | | | \ . --------. | | | CLI | / \ | | | | | CP protocol | | | \ /->> -. | component | <-. | | \__ _/ | | For | | | | | | IP service | ^ | | Y | foo | | | | | \___________/ ^ | | Y 1,4,6,8,9 / ^ 2,5,10 | 3,7 | --------------- Y------------/---|----------|----------- | ^ | ^ **|***********|****|**********|********** ************* Netlink2 layer ************ **|***********|****|**********|********** FE | | ^ ^ .-------- Y-----------Y----|--------- |----. | \ | / | | \ Y / | | .\ --------^-------. / | | |FE component/module|/ | | | for IP Service | | --->---|------>---| foo |----->-----|------>-- | ------------------- | | | | | ------------------------------------------ The control plane protocol for IP service foo does the following to connect to its FE counterpart. The steps below are also numbered in the diagram above. 1. Connect to IP service foo through a socket connect. A typical Hadi Salim, et al. Expires April 25, 2004 [Page 50] Internet-Draft Netlink2 as ForCES Protocol October 2003 connection would be via a call to: socket(AF_NETLINK, SOCK_RAW, NETLINK_FOO) 2. Bind to listen to specific async events for service foo 3. Bind to listen to specific async FE events Note that a wrapper socket can be created on top of the real sockets: depending on the dest PID given, it chooses the most appropriate socket to send the packet onto (if here are two multicast groups, one for all FEs, and one for all FEs and CEs, a packet from the CE to the FEs will use the first multicast group). The wrapper socket basically maps a message to the most appropriate wire in the bundle. Hadi Salim, et al. Expires April 25, 2004 [Page 51] Internet-Draft Netlink2 as ForCES Protocol October 2003 Appendix B. Sample Protocol for the foo IP Service Our proverbial IP service "foo" is used again to demonstrate how one can deploy a simple IP service control using Netlink2. These steps are continued from Appendix 1 (hence the numbering). 1. query for current config of FE component 2. receive response to 4) via channel on 3) 3. query for current state of IP service foo 4. receive response to 6) via channel on 2) 5. register the protocol specific packets you would like the FE to forward to you 6. send specific service foo commands and receive responses for them if needed B.1 Interacting with Other IP Services The diagram in Appendix 1 shows another control component configuring the same service. In this case, it is a proprietary Command Line Interface. The CLI may or may not be using the Netlink protocol to communicate with the foo component. If the CLI should issue commands that will affect the policy of the LFB for service "foo", then the "foo" CPC is notified. It could then make algorithmic decisions based on this input. For example if a FE allowed another service to delete policies installed by a different service and a policy that foo installed was deleted by service bar, there might be a need to propagate this to all the peers of service "foo"). Hadi Salim, et al. Expires April 25, 2004 [Page 52] Internet-Draft Netlink2 as ForCES Protocol October 2003 Appendix C. Examples In this example we show a simple configuration Netlink2 message sent from a TC CPC to an egress TC FIFO queue. This queue algorithm is based on packet counting and drops packets when the limit exceeds the configured limit (100 packets in the example policy below). We assume the queue is in hierarchical setup with a parent 100:0 and a classid of 100:1 and that it is to be installed on device with ifindex of 4. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version | Flags_E | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type (RTM_NEWQDISC) | Flags (NLM_F_EXCL | | | |NLM_F_CREATE | NLM_F_REQUEST) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number (arbitrary number) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type == NL2_SERVICE | Outer Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type == NL2_QDISC | Inner Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Family(AF_INET)| Reserved1 | Reserved1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index (4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Qdisc handle (0x1000001) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Parent Qdisc (0x1000000) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCM Info (0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type (TCA_KIND) | Length(4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Value ("pfifo") | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type (TCA_OPTIONS) | Length(4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Value (limit=100) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Hadi Salim, et al. Expires April 25, 2004 [Page 53] Internet-Draft Netlink2 as ForCES Protocol October 2003 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assignees. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION Hadi Salim, et al. Expires April 25, 2004 [Page 54] Internet-Draft Netlink2 as ForCES Protocol October 2003 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Hadi Salim, et al. Expires April 25, 2004 [Page 55]