Network Working Group R. R. Stewart INTERNET-DRAFT Cisco Q. Xie L Yarroll Motorola J. Wood K. Poon Sun Microsystems expires in six months July 13,2000 SCTP Sockets Mapping Status of This Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes a mapping of the Stream Control Transmission Protocol (SCTP) [SCTP] into a sockets API. The benefits of this mapping include compatibility for TCP applications, access to new SCTP features and a consolidate error and event notification scheme. 1. Introduction This document describes a mapping of the Stream Control Transmission Protocol (SCTP) [SCTP] into a sockets API. The benefits of this mapping include compatibility for TCP applications, access to new SCTP features and a consolidate error and event notification scheme. The sockets API has provided a standard mapping of the Internet Protocol suite to many operating systems. Both TCP [TCP] and UDP [UDP] have benefited from this standard representation and access method across many diverse platforms. SCTP is a new protocol that provides many of the characteristics of TCP but also try to incorporate semantics more akin to UDP. This document will attempt to define a methodology to map the existing sockets API for use with SCTP, providing a base for access to new features but also a compatibility so that many existing TCP applications can be migrated to SCTP with few (if any) changes. There are three basic design goals: 1. Define a sockets mapping for SCTP that is consistent with other protocol mappings (for instance, UDP, TCP, IPv4, and IPv6) to the API. 2. The mapping should provide the same semantics as sockets for connection-oriented protocols, such as TCP, so that existing applications for these protocols can be ported to use SCTP with very little effort, and developers familiar with that semantics can easily adapt to SCTP. At the same time, the mapping should provide mechanisms to exploit new features of SCTP. 3. Provide new semantics that map more closely to SCTP. These semantics are similar to the those defined for conntionless protocols, such as UDP. Note that SCTP is connection-oriented in nature. It does not support broadcast or multicast communications, as UDP does. Goals two and three are not compatible, so this document defines two modes of mapping. They share some common structures but provide two different programming models. Section 6 defines structures common to both modes. Section 4 defines the connectionless mode. Section 5 defines the connection-oriented mode. 2. Conventions The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL, when they appear in this document, are to be interpreted as described in RFC 2119 [RFC2119]. 3. Data Structures The primary sockets interface method used by SCTP is the sendmsg(), recvmsg() system calls. These calls require the following format: ssize_t recvmsg(int socket, struct msghdr *message,int flags); ssize_t sendmsg(int socket, struct msghdr *message,int flags); socket: 32 bits (signed integer) This is the integer value returned from the socket call that established this endpoint. For a defintion of msghdr please refer to [STEVENS]. Scatter/gather buffers, or I/O vectors (the msg_iov field in the msghdr structure) are treated as a single SCTP data chunk, rather than multiple chunks, for both sendmsg and recvmsg. The flags within the msghdr structure is used to communicate one of the following: SYNCHRONOUS - This flag indicates that the application wishes the call to be blocking. If the association is not up the function will block until it has been created. A_SYNCHRONOUS_IO - This flag indicates that the application wishes the function call to be non blocking and return immediately. 3.1 The SCTP_msgcontrol structure A key element to all SCTP extensions is the ability to specify various command options on each send or to find out specific information from each receive. The msg_control structure is the primary mechanism for this communication and is defined as in the form of cmsghdrs. Each cmsghdr contains a data portion (see [STEVENS] appendix A for details of there useage). The following subsections dictate the allowable structures that may be passed in a cmsghdr to/from the SCTP protocol stack. Note, when processing the SCTP ancillary data structures defined here it is also possible to receive other ancillary data (e.g. IPv6 structures). A application should be prepared to handle other types of data besides that passed by SCTP. 3.1.1 The INIT structure. The following details the structure used when implicitly starting a new association with a sendmsg() operation. struct sctp_initmsg { uint16_t sinit_num_ostreams; uint16_t sinit_max_instreams; uint16_t sinit_max_attempts; uint16_t sinit_max_init_timeo; }; sinit_num_ostreams: 16 bits (unsigned integer) This is an integer number representing the number of streams that the application wishes to be able to send to. This number is confirmed in the COMM_UP notification and must be verified since it is a negotiated number with the remote endpoint. The default value of 0 indicates to use the endpoint default value. sinit_max_instreams: 16 bits (unsigned integer) This value represents the maximum number of streams the application is prepared to support. This value is bounded by the actual implementation. In other words the user MAY be able to support more streams than the Operating System. In such a case, the Operating System limit will override the value requested by the user. The default value of 0 indicates to use the endpoints default value. sinit_max_attempts: 16 bits (unsigned integer) This integer is used to specify how many attempts the SCTP endpoint should make at resending the INIT when no response is obtained. This value overrides the system SCTP 'Max.Init.Retransmits' value. The default value of 0 indicates to use the endpoints default value normally set to the systems default 'Max.Init.Retransmit' value. sinit_max_init_timeo: 16 bits (unsigned integer) This value represents the largest Time-Out or RTO value to use in attempting a INIT. Normally the 'RTO.Max' is used to limit the doubling of the RTO upon time-out. For the INIT message this value MAY override this 'RTO.Max'. This value MUST NOT influence 'RTO.Max' during data transmission and is only used to bound the initial setup time an application is willing to wait. A default value of 0 indicates to use the endpoints default value normally set to the systems 'RTO.Max' value (60 seconds). 3.1.2 the SND_RCV structure. Whenever a datagram is sent specific options may need to be specified by the sending application. To do this the SND_RCV structure is filled appropriately with the requested characteristics. The structure has the following format: struct sctp_sndrcvinfo { uint32_t sinfo_tsn; uint16_t sinfo_stream; uint16_t sinfo_ssn; uint16_t sinfo_delivery_num; uint16_t sinfo_flags; uint32_t sinfo_ppid; uint32_t sinfo_context; uint8_t sinfo_tos; }; sinfo_tsn:32 bits (unsigned integer) For the recvmsg() call this value will contain the TSN number that the remote endpoint placed in the DATA chunk. For fragmented messages it is implementation dependent on which TSN number appears in this location. The sendmsg() call will ignore this parameter. sinfo_stream: 16 bits (unsigned integer) For the recvmsg() call this value will contain the stream number that this message was sent to. For the sendmsg() call this value will hold the stream number that the application wishes to send this message to. If a sender specifies an invalid stream number an error indication will be returned and the call will fail. sinfo_ssn: 16 bits (unsigned integer) For the recvmsg() call this value will contain the stream sequence number that the remote endpoint placed in the DATA chunk. For fragmented messages this will be the same number for all deliveries of the message (if more than one recvmsg() call is needed to read the message). The sendmsg() call will ignore this parameter. sinfo_delivery_num: 16 bits (unsigned integer) This value holds the delivery number used by the partial delivery mechanism. In some cases the SCTP endpoint will need to deliver a large message in pieces. When this occurs, the delivery number will be incremented with each subsequent delivery. The delivery number is set to 0 on the first delivery. The sendmsg() call ignores this field. sinfo_ppid:32 bits (unsigned integer) This value in the sendmsg() is a opaque unsigned value that is passed to the remote end in each user message. In the recvmsg() call this value is the same information that was passed by the upper layer in the peer application. Please note that byte order issues are NOT accounted for and this information is passed opaquely by the SCTP stack from one end to the other. sinfo_context:32 bits (unsigned integer) This value is an opaque 32 bit context information that is used in the sendmsg() function. This value will be passed back to the upper layer if a error occurs on the send of a message and will be retrieved with each unsent message. sinfo_flags: 16 bits (unsigned integer) This field may contain any of the following flags and is composed of a bitwise or of these values. recvmsg() flags: MORE_DATA - This flag is present if a subsequent delivery will be following. Subsequent recvmsg() calls will retrieve further piece(s) of the message. MSG_EOR - This flag is present in the last piece of a message. UNORDERED - This flag is present when the message was sent with the non-ordered flag. sendmsg() flags: UNORDERED - This flag is present to request the un-ordered delivery of the message. If this flag is not present the datagram is considered to be an ordered send. ABORT - The setting of this flag causes the specified association to be aborted by sending an ABORT message to the peer. SHUTDOWN - The setting of this flag invoke the SCTP graceful shutdown procedures which will assure that all data in-queued by both endpoints are successfully transmitted before closing the association. TOS: 8 bits (unsigned integer) This field is available to change th TOS value in the outbound IP packet. The default value of this field is 0. Note only 6 bits of this byte are used, the upper 2 bits are not part of the TOS field any setting within these upper 2 bits are ignored. 3.2 Events A SCTP application will need to be able to understand and process a number of events and errors that happen on the SCTP stack. These events include networks status changes, association startups, and undeliverable datagrams. All of these are essential for the application to process. When a SCTP application layer does a recvmsg() most often the message read will be a Data message from a peer endpoint. However when the SCTP stack wishes to communicate a event notification to the application it will set msg_flags in the msghdr to IS_EVENT. The msg_control structure will be overwritten with a cmsghdr structure that will define what type of event is being communicated. The data portion of the msghdr i.e. the msg_iov will contain the information communicated with the event or error. 3.2.1 The cmsghdr structure For a definition of the cmsghdr structure and its used please refer to [STEVENS]. For SCTP the cmsg_level field will be encoded with IPPROTO_SCTP. 3.2.2 SCTP Notification and Event types The following table illustrates the SCTP notification and event types. Name Description Value --------- --------------------------- ------ SCTP_DATA_IO_EVENT This indicates a normal User 1 Data message is being sent and or received. Special options and directions are also included in the SCTP_msgcontrol structure as defined in 3. SCTP_ASSOC_CHANGE This indication is passed 2 to indicate that a association has either been opened or closed. The data found in the msg_iov section is detailed in 3.2.2.1. SCTP_INTF_CHANGE This indication is passed 3 to indicate that a address that is part of an existing association has experienced a change of state (i.e. a failure or return to service of the reachability of a endpoint via a specific transport address). Please see 3.2.2.2 for data structure details. SCTP_SEND_FAILED The attached datagram (in the 4 msg_iov area) could not be sent to the remote endpoint. This structure includes the original SCTP_DATA_IO_EVENT that was used in sending this message i.e. this structure uses the SCTP_msgcontrol per section 3.1. SCTP_REMOTE_ERROR The attached error message (in the 5 msg_iov area) is a Operational received from the remote peer. It includes the complete TLV sent by the remote endpoint. See section 3.2.2.3 for the detailed format. 3.2.2.1 Communication notifications Communication notifications are used to inform the ULP that a SCTP association has either begun or ended. The notification information is passed into the msg_iov data and has the following format: struct commNotification{ struct socket_storage primaryAddr; int state; int errorCode; }; primaryAddr: variable based on address length The primaryAddr field holds one of the remote peers addresses. If the peer is NOT multi-homed the primaryAddr holds the only address of the peer. The socket_storage structure is found in [RFC2553]. state: 32 bits (signed integer) This field holds one of a number of values that communicate the event that happened to the association. They include: Event Name Value Description ---------------- ----- --------------- COMMUNICATION_UP 1 A new association is now ready and data may be exchanged with this peer. COMMUNICATION_LOST 2 The association identified by the address has failed. The association is now in the closed state. SHUTDOWN_COMPLETE 3 The association has gracefully closed. CANT_START_ASSOC 4 The association failed to setup. errorCode: 32 bits (signed integer) If the state was reached due to a error condition (i.e. COMMUNCIATION_LOST) any relevant error information is available in this field. 3.2.2.2 Interface details When a destination address on a multi-homed peer encounters a change in reachability a Interface details event is sent. The information carried in the msg_iov will hold the following structure: struct intefaceEvent{ struct socket_storage primaryAddr; struct socket_storage affectedAddr; int state; int errorCode; } primaryAddr: variable based on address length The primaryAddr field holds the remote endpoints address that was announced in the COMMUNICATION_UP notification. The socket_storage structure is found in [RFC2553]. affectedAddr: variable based on address length The effectedAddr field holds the remote peers addresses of the association that is encountering the change of state. The socket_storage structure is found in [RFC2553]. state: 32 bits (signed integer) This field holds one of a number of values that communicate the event that happened to the association. They include: Event Name Value Description ---------------- ----- --------------- ADDRESS_AVAILABLE 1 This address is now reachable. ADDRESS_UNREACHABLE 2 The address specified can no longer be reached. Any data sent to this address will be rerouted to an alternate until this address is considered reachable. errorCode: 32 bits (signed integer) If the state was reached due to any error condition (i.e. ADDRESS_UNREACHABLE) any relevant error information is available in this field. 3.2.2.3 SCTP Communication error. A remote peer may send a Operational Error message to its peer. This message will indicate a variety of error conditions on an association. This notification will be accompanied by the complete SCTP TLV in the msg_iov structure. Please refer to the SCTP specification [SCTP] section 3.3.10 for a complete list of possible error formats. In general the messages will have the format: struct OperationalError{ unsigned short causeCode; unsigned short causeLength; unsigned char causeInfo[]; }; causeCode: 16 bits (unsigned integer) This value represents one of the Operational Error causes defined in the SCTP specification. causeLength: 16 bits (unsigned integer) This value represents the length including the causeCode, causeLength and any additional information carried in causeInfo. causeInfo: variable This represents the detailed error information sent by the remote endpoint. 4. Datagram Interface The datagram interface to SCTP attempts to emulate UDP more than it does TCP. It does this in a number of ways: A) Support of implicit association setup. B) Messages are delivered in complete messages with one notable exception. C) Automatic acceptance of a new associations. IOCTLs do exist to convert a SCTP endpoint into a TCP compatible socket. A typical server in this model uses the following socket calls in sequence to prepare an endpoint for servicing requests: socket(), bind() At this point new association's will be discovered by the server when a EVENT is reported from the SCTP stack via a recvmsg() recvmsg(), sendmsg() It may call sendmsg() to terminate this association, passing in no user DATA but including the appropriate flag in the ancillary data. A typical client uses the following calls in sequence to setup an association with a server to request services: socket(), sendmsg(), recvmsg() It may call sendmsg() to terminate this association, passing in no user DATA but including the appropriate flag in the ancillary data. A server or client may wish to branch an association off to its own socket. It my do this by calling accept(), specifying one of the addresses of an exisiting association. accept() will return a new socket which can then be used with recv()/send() messgage calls. 4.1.1 socket() Applications use socket() to create a socket descriptor to represent a SCTP endpoint. The syntax is sd = socket(PF_INET, SOCK_SEQPACKET, IPPROTO_SCTP); Or sd = socket(PF_INET6, SOCK_SEQAPCKET, IPPROTO_SCTP); The first one creates an endpoint which can use only IPv4 addresses. The second one creates an endpoint which can use both IPv6 and IPv4 mapped addresses. 4.1.2 bind() Applications use bind() to pass addresses associated with a SCTP endpoint to the system. Those addresses are the eligible transport addresses for sending and receiving data the endpoint can use to present to its peers. The syntax is ret = bind(int sd, struct sockaddr *addr, int addrlen); sd is the socket descriptor created by socket(), addr is the address structure (struct sockaddr_in or struct sockaddr_in6 [RFC 2553]), and addrlen is the size of the address structure. Caller should use struct sockaddr_storage described in RFC 2553 to represent addr for portability reason. If sd is a IPv4 socket, the address has to be a IPv4 address. If sd is a IPv6 socket,the address can be a IPv4 or IPv6 address. Applications can call bind() multiple times to associate multiple addresses to the endpoint. If the IPv4 address specified is INADDR_ANY or the IPv6 address specified is in6addr_any, which is normally used by server applications, the system will associate the endpoint with all its interfaces. Note that these wildcard addresses can only be used one and only one time in bind(). This means that it has to be used in the first bind() call, and the application cannot call bind() on that endpoint again. After calling bind(), the SCTP endpoint will accept all SCTP INIT requests passing the COMMUNICATION_UP notification to the endpoint upon reception of a valid associaition (i.e. receipt of a valid COOKIE ECHO). 4.1.3 sendmsg() and recvmsg() Applications use sendmsg() and recvmsg() to transmit data to and receive data from its peer. These calls takes the form of: ssize_t sendmsg(int socket, const struct msghdr *message, int flags); ssize_t recvmsg(int socket, struct msghdr *message, int flags); socket - Is the socket descriptor of the message, message - Is the msghdr structure containing the single message and any ancillary data. flags - Is the flags field sent/received with the message. See section 3 for a description of the valid flag values. The sendmsg/recvmsg calls can be used to send and recieve data. Along with this data the ancillary data field in the msg_control is allowed to carry the sctp_sndrcvinfo and/or the sctp_initmsg structures to specify various options for sending. When sending the msg_name field is filled in with one of the addresses of an existing association OR the address of a new association to setup. Upon reception of data the msg_name field is populated with the source of the data. Note: if the socket is a high bandwidth socket that only represents one association (see section 4.3) then the msg_name field is ignored upon sending data. For this interface style an application SHOULD use the sendmsg() call to shutdown an association using the appropriate flags in the ancillary information. 4.1.4 close() Applications use close() to close down an association or the main datagram socket. The syntax is ret = close(int sd); sd is the socket descriptor of the association to be closed. For a high bandwidth socket (see 4.3), this will invoke the normal SCTP Shutdown primitve. If the user attempts to close() the initial SCTP datagram socket all associations represented by the socket initiate the SCTP Shutdown primitive. This is similar to the SHUTDOWN primitive described in [SCTP] section 10.1. The system will gracefully close down the association. 4.2 Sending and receiving data with implicit association setup Once all binding is complete the endpoint may begin sending and receiving data using the sendmsg or recvmsg calls. Any time a new address (i.e. one that an association is NOT set up with) is specified in the msg_name field of the msghdr structure sent in sendmsg() call, an implicit association is begun with that endpoint. No connect() system call is required. Upon successful association setup a COMMUNICATION_UP notification will be dispatched to the socket and thus read by the recvmsg() system call. Note also that if the implementation supports bundling, the COOKIE ECHO message will be bundled with the data message sent to the remote endpoint. When a new association is began implicitly the SCTP_msgcontrol structure will be consulted for any special options that the endpoint should be setup with. The init_parameters field is used to pass this special information. By default when this information is not present the default endpoint initialization parameters will be used. These may be set with respective ioctl calls or left to the system defaults. A endpoint may identify messages received from an association based on the msg_name field received in the recvmsg() call. Other information such as stream number and stream sequence number are also populated in the SCTP_msgcontrol structure. Likewise, when sending data messages, the send_recv_parameters structure can be populated with specific stream number, TOS, flags, context and protocol Id (sinfo_ppid) values that will effect each specific send. Use of other calls for sending and receiving (i.e. sendto/recvfrom) will assume default parameters for all additional things that would normally be specified in the ancillary data passed to the sendmsg/recvmsg call. Use of the send/recv calls is restricted to high bandwidth associations only (section 4.3). 4.3 High bandwidth association During the life of an endpoint it may be desirable by a application that a specific association be branched out into a separate file descriptor. An application may wish to have a number of sporadic message senders grouped under the generic SCTP endpoint socket and specific high volume data associations placed each under there own respective file descriptor. To do this a datagram mode SCTP endpoint MAY use the accept() system call. Please note the semantics for this are somewhat changed from the traditional meaning. The following is the signature of the accept system call: int accept(int socket, struct sockaddr *who, int *who_len) socket: 32 bits (signed integer) This is the main SCTP endpoint socket/file descriptor i.e. the one initially opened by the socket() system call. who: pointer This is the specific address of the association that is desired to be pulled on to a separate file descriptor. In a traditional TCP call this would be a out parameter, but for the datagram based SCTP this is a IN parameter and specifies the association that is to be accepted into its own socket descriptor. who_len: pointer This field holds a integer pointer to the size of the sockaddr structure carried in the who field. In a traditional TCP call this would be a out parameter, but for the datagram based SCTP this is a IN parameter. 4.4 Closing - graceful and abortive. At some point in communication with a peer the upper layer may wish to close an association. To do this the sendmsg() call will be used. The sendmsg() call should be made with NO data and the ABORT or SHUTDOWN flag set in the sinfo_flags. 4.5 An example session [ To be filled in later ] 5. Connection oriented model The goal of this model is to follow as closely as possible the current practice of using sockets interface for a connection oriented protocol, such as TCP. Because of this, using certain socket calls, such as send() and recv(), may restrict the caller to a subset of features SCTP provides. But with this model, existing applications using connection oriented protocol can be ported to use SCTP with very little effort. In order to utilize most SCTP features, new SCTP socket options, sendmsg() and recvmsg() have to be used. The following is a simple example to illustrate how this model works. A typical server in this model uses the following socket calls in sequence to prepare an endpoint for servicing requests: socket(), bind(), listen(), accept() accept() blocks until a new assocation is setup. It returns with a new socket descriptor. The server then uses the new socket descriptor to communicate with the client. It may call the followiing in a loop until it has processed all requests. recv(), send() Then it calls close() to terminate this association. A typical client uses the following calls in sequence to setup an association with a server to request services: socket(), connect() After connect() succeeds, it may call the following in a loop until it has sent and received all requests. send(), recv() Then it calls close() to terminate this association. 5.1 Socket Interface This section specifies the SCTP socket interface. Editors Note: [Should we include return code of these calls in the draft or should it be in a man page of different OSes? We may want to map special error code for SCTP. EMSGSIZE is used below. And if an app chooses not to receive event, we need to map some of those events to an error. We need to figure out the mapping.] 5.1.1 socket() Applications use socket() to create a socket descriptor to represent a SCTP endpoint. The syntax is sd = socket(PF_INET, SOCK_STREAM, IPPROTO_SCTP); Or sd = socket(PF_INET6, SOCK_STREAM, IPPROTO_SCTP); The first one creates an endpoint which can use only IPv4 addresses. The second one creates an endpoint which can use both IPv6 and IPv4 mapped addresses. 5.1.2 bind() Applications use bind() to pass addresses assoicated with a SCTP endpoint to the system . Those addresses are the eligible transport addresses for sending and receiving data the endpoint can use to present to its peers. The syntax is ret = bind(int sd, struct sockaddr *addr, int addrlen); sd is the socket descriptor created by socket(), addr is the address structure (struct sockaddr_in or struct sockaddr_in6 [RFC 2553]), and addrlen is the size of the address structure. Caller should use struct sockaddr_storage described in RFC 2553 to represent addr for portability reason. If sd is a IPv4 socket, the address has to be a IPv4 address. If sd is a IPv6 socket,the address can be a IPv4 or IPv6 address. Applications can call bind() multiple times to associate multiple addresses to the endpoint. If the IPv4 address specified is INADDR_ANY or the IPv6 address specified is in6addr_any, which is normally used by server applications, the system will associate the endpoint with all its interfaces. Note that these wildcard addresses can only be used one and only one time in bind(). This means that it has to be used in the first bind() call, and the application cannot call bind() on that endpoint again. After calling bind(), the SCTP endpoint will accept all SCTP INIT requests. But it will promptly send an ABORT and discard data received until a listen(), described below, is performed on the socket. 5.1.3 listen() Applications use listen() to inform the system that it is ready to accept SCTP associations. The syntax is ret = listen(int sd, int backlog); sd - is the socket descriptor of a SCTP endpoint, and backlog - is the number of outstanding associations in the socket's accept queue. These associations have already finished the four-way INIT handshake [SCTP] section 5 and are in ESTABLISHED SCTP state. 5.1.4 accept() Applications use accept() to remove ESTABLISHED SCTP assocations from the accept queue. A new socket descriptor is created to represent the new association. The syntax is: new_sd = accept(int sd, struct sockaddr *addr, socklen_t *addrlen); sd - is the listening socket descriptor, addr - will contain the primary address of the peer endpoint when accept() returns, addrlen - will store the size of the address returned. new_sd - is the socket descriptor of the new association. 5.1.6 connect() Applications use connect() to initiate an association to a peer. The syntax is ret = connect(int sd, const struct sockaddr *addr, int addrlen); sd is the socket descriptor created by socket(), addr is the peer's address, and addrlen is the size of the address. This is similar to the ASSOCIATE primitive described in [SCTP] section 10.1. By default, there is only one outbound stream. To change this, use the SCTP_INITMSG option described in section 6. If a bind() is not called prior to the connect() call, the system will pick an ephemeral port and one of its addresses as the primary address for the association. If an application wants to utilitize the multi-homing feature of SCTP, it needs to call bind() before calling connect(). Editors Note: [Another semantics to consider is without bind(), the system will use one address from each interface to create the list of addresses for the association. This automatically enables the fault tolerant feature for existing applications ported to use SCTP without any special change. ] 5.1.7 close() Applications use close() to close does an association. The syntax is ret = close(int sd); sd is the socket descriptor of the association to be closed. This is similar to the SHUTDOWN primitive described in [SCTP] section 10.1. The system will gracefully close down the association. 5.1.8 shutdown() Applications use shutdown() to abort an association. The syntax is ret = shutdown(int sd, int how); sd is the socket descriptor and how does not have meaning. This is similar to the ABORT primitive described in [SCTP] section 10.1. The system will terminate the association and release the resources used by it. The value of how does not affect the behavior of the ABORT. To pass a cause code [SCTP] section 10.1, the caller should use sendmsg() described below. Note also that a caller can use the MSG_ABORT flag in sendmsg() to abort an association, without calling shutdown(). 5.1.9 sendmsg() and recvmsg() Applications use sendmsg() and recvmsg() to transmit data to and receive data from its peer. The semantics is similar to the Datagram model described in Section 4. There are several differences. 1. The msg_name field in the msghdr normally is not filled in. When sending, if the caller wants to send to a different peer address other than the primary address, it can set the address in the msg_name field. When receiving, if a message is not received from the primary address, the system fills in the msg_name field so that the caller can retrieve the information. 2. The INIT type must not be used in msghdr. To change the default initialization behavior, the caller can use SCTP_INITMSG socket option. 3. The caller must use close() to gracefully shutdown an assocication. If the caller sets the ABORT or SHUTDOWN flag in sendmsg(), the system will return an error. 5.2 Examples [To be filled in... ] 6.1 Common calls and operations to both models. 6.1 send(), recv(), sendto(), recvfrom() Applications can use send() and sendto() to transmit data to the peer of a SCTP endpoint. recv() and recvfrom() can be used to receive data from the peer. The syntax is size = send(int sd, connst void *msg, size_t len, int flags); size = sendto(int sd, const void *msg, size_t len, int flags, const struct sockaddr *to, int tolen); size = recv(int sd, void *buf, size_t len, int flags); size = recvfrom(int sd, void *buf, size_t len, int flags, struct sockaddr *from, int *fromlen); sd - is the socket descriptor of a SCTP endpoint, msg - is the message to be sent, len - is the size of the message or the size of buffer, to - is one of the peer addresses of the association to be used to send the message, tolen - is the size of the address, buf - is the buffer to store a received message, from - is the buffer to store the peer address used to send the received message, fromlen - is the size of the receive buffer, flags -is described below. SCTP has the concept of multiple streams in one association. The above calls do not allow the caller to specify which stream a message should send to or received from. The system uses stream 0 as the default stream for the above calls. In all calls listed above the socket descriptor passed to these calls must represent a single association. SCTP is message based. The msg buffer above in send() and sendto() is considered to be a single message. This means that if the caller wants to send a message which is composed by several buffers, the caller needs to combine them before calling send() or sendto(). Or the caller can use sendmsg() to do that without combining them. In receiving, if the buffer supplied is not large enough to hold a complete messaage, the receive call will return a EMSGSIZE error. Refer to recvmsg() for a method to receive partial message. The flags parameter is formed by ORing one or more of the following: UNORDERED SCTP has a concept of unordered delivery. When sending, caller can use this flag to tell the system that this message can be delivered unordered. The caller must set this flag in all calls to transmit unorderd messages. Note: that these calls, when used in the datagram mode, may only be used with high bandwidth socket descriptors (see section 4.3). 6.2 setsockopt(), getsockopt() Applications use setsockopt() and getsockopt() to set or retrieve socket options. Socket options are used to change the default behavior of sockets calls. They are described in 6.4. The syntax is: ret = getsockopt(int sd, int level, int optname, void *optval, int *optlen); ret = setsockopt(int sd, int level, int optname, const void *optval, int optlen); sd - is the socket descript, level - is IPPROTO_SCTP for all SCTP options, optname - is the option name, optval - is the buffer to store the value of the option, optlen - is the size of the buffer. 6.3 read() and write() Applications can use read() and write() to send and receive data to and from peer. They have the same semantics as send() and recv() except that the flags parameter cannot be used. Note: that these calls, when used in the datagram mode, may only be used with high bandwidth socket descriptors (see section 4.3). 6.4 Socket Options The following sub-section dictates various socket operations that are common to both models. All optional parameters include a socket_storage structure. For the datagram model this MUST be set to identify the association instance that the operation effects. For the connnection oriented model and high bandwidth datagram sockets (see section 4.3) this parameter is ignored. 6.4.1 Read / Write Options 6.4.1.1 Retransmission Timeout Parameters (SCTP_RTOINFO) The protocol parameters used to initialize and bound retransmission timeout (RTO) are tunable. See [SCTP] for more information on how these parameters are used in RTO calculation. The following structure is used to access and modify these parameters: struct sctp_rtoinfo { struct socket_storage srto_address; uint32_t srto_initial; uint32_t srto_max; uint32_t srto_min; }; srto_address is the identifying address as described in 6.4, srto_initial contains the initial RTO value, srto_max and srto_min contain the maximum and minumum bounds for all RTOs. All parameters are time values, in milliseconds. A value of 0, when modifying the parameters, indicates that the current value should not be changed. To access or modify these parameters, the application should call getsockopt or setsockopt() respectively with the option name SCTP_RTOINFO. 6.4.1.2 Association Retransmission Parameter (SCTP_ASSOCRTXINFO) The protocol parameter used to set the number of retransmissions sent before an association is considered unreachable is tunable. See [SCTP] for more information on how this parameter is used. The following structure is used to access and modify this parameters: struct sctp_assocparams { struct sockaddr_storage sasoc_address; uint16_t sasoc_asocmaxrxt; }; sasoc_address is the identifying address as described in 6.4, sasoc_asocmaxrxt contains the maximum retransmission attempts to make for the association. To access or modify these parameters, the application should call gesockopt or setsockopt() respectively with the option name SCTP_ASSOCRTXINFO. The maximum number of retransmissions before a path is considered unreachable is also tunable, but is path-specific, so it is covered in a seperate option. If an application attempts to set the value of the association maximum retransmission parameter to less than the sum of all path maximum retransmission parameters, setsockopt() shall return an error. The reason for this, from [SCTP] section 8.2: Note: When configuring the SCTP endpoint, the user should avoid having the value of 'Association.Max.Retrans' larger than the summation of the 'Path.Max.Retrans' of all the destination addresses for the remote endpoint. Otherwise, all the destination addresses may become inactive while the endpoint still considers the peer endpoint reachable. 6.4.1.3 Path Parameters (SCTP_PATHPARAMS) Applications can enable or disable heartbeats for any path, modify a path's heartbeat interval, and adjust the path's maximum number of retransmissions sent before a path is considered unreachable. The following structure is used to access and modify a path's parameters: struct sctp_pathparams { struct sockaddr_storage spp_path; uint32_t spp_interval; uint16_t spp_pathmaxrxt; }; spp_path specifies which path is of interest (for the datagram model this also will infer the association in question). spp_interval contains the value of the heartbeat interval, in milliseconds. A value of 0, when modifying the parameter, specifies that the heartbeat on this path should be disabled. spp_pathmaxrxt contains the maximum number of retransmissions before this path shall be considered unreachable. To access or modify these parameters, the application should call gesockopt or setsockopt() respectively with the option name SCTP_PATHPARAMS. 6.4.1.4 Initialization Parameters (SCTP_INITMSG) Applications can specify protocol parameters for the default association intialization. The structure used to access and modify these parameters is defined in section 3.1.1. The option name argument to setsockopt() and getsockopt() is SCTP_INITMSG. Setting initialization parameters is effective only on an unconnected socket (for the datagram model only future associations are effected by the change). 6.4.2 Read-Only Options 6.4.2.1 Path Information (SCTP_PATHINFO) Applications can retrieve information about a path, including its reachability state, congestion window, and retransmission timer values. This information is read-only, so only getsockopt() operates on this option. Calls to setsockopt() on this option will return an error. The following structure is used to access this information: struct sctp_pathinfo { struct sockaddr_storage spath_path; int spath_state; uint32_t spath_cwnd; uint32_t spath_srtt; uint32_t spath_rto; }; spath_path is filled in the the application, and contains the path of interest (for the datagram model this also will infer the association in question). On return from getsockopt(), spath_state will contain the path's state (either SCTP_ACTIVE or SCTP_INACTIVE), spath_cwnd the path's current congestion window, spath_srtt the path's current smoothed round-trip time calculation, and spath_rto the path's current retransmission timeout value. spath_srtt and spath_rto are in milliseconds. To retrieve this information, use getsockopt() with the option name set to SCTP_PATHINFO. 6.4.2.2 Peer Endpoint's Set of Addresses (SCTP_PATHCOUNT, SCTP_ALLPATHS) Applications can retrieve the set of addresses that correspond to a peer endpoint. Since this set is variable length, two options are needed to retrieve the information: the first, SCTP_PATHCOUNT, takes the following structure as its argument to getsockopt: struct sctp_pathcnt{ struct sockaddr_storage spthc_address; uint32_t spthc_numpaths }; srto_address is the identifying address as described in 6.4, spthc_numpaths if filled in upon return from this call indicating the number of addresses associated with the peer. The application can then allocate a buffer large enough to hold all the peer's addresses, and call getsockopt() with SCTP_ALLPATHS. For the datagram model, the first address in the call to SCTP_ALLPATHS MUST be filled in with a valid address that identifies the association. On return, each address is represented as a struct sockaddr_storage, so if n contains the number of peer addresses, the caller must allocate a buffer of size n * sizeof (struct sockaddr_storage). The application can retrieve information on each path by enumerating through the returned list of addresses and calling getsockopt() with the SCTP_PATHINFO option name. This information is read-only. 6.4.2.3 Association Status (SCTP_STATUS) Applications can retrieve current status information an association, including association state, peer receiver window size, number of unacked data chunks, and number of data chunks pending receipt. This information is read-only. The following structure is used to access this information: struct sctp_status { struct sockaddr_storage sstat_address; int sstat_state; uint32_t sstat_rwnd; uint16_t sstat_unackdata; uint16_t sstat_penddata; struct sctp_pathinfo sstat_primary; }; sstat_address - sstat_address is the identifying address as described in 6.4, sstat_state - contains the association's current state (states TBD), sstat_rwnd - contains the association peer's current receiver window size, sstat_unackdata - the number of unacked data chunks, sstat_penddata - the number of data chunks pending receipt, and sstat_primary - information on the current primary path. To access this status values, the application calls getsockopt() with the option name SCTP_STATUS. 6.4.3. Ancillary Data Interest Options Applications can receive notifications of certain SCTP events and per-message information as ancillary data with recvmsg(). The following optional information is available to the application: 1. SCTP_RECVDATAIOEVNT: Per-message information (i.e. stream number, TSN, SSN, etc. described in section 3.2.2) 2. SCTP_RECVASSOCEVNT: (described in section 3.2.2) 3. SCTP_RECVPATHEVNT: (described in section 3.2.2) 4. SCTP_RECVSENDFAILEVNT: (described in section 3.2.2) 5. SCTP_RECVPEERERR: (described in section 3.2.2) To receive any ancillary data, first the application register it's interest by calling setsockopt() to turn on the corresponding flag: int on = 1; setsockopt(fd, IPPROTO_SCTP, SCTP_RECVDATAIOEVNT, &on, sizeof(on)); setsockopt(fd, IPPROTO_SCTP, SCTP_RECVASSOCEVNT, &on, sizeof(on)); setsockopt(fd, IPPROTO_SCTP, SCTP_RECVPATHEVNT, &on, sizeof(on)); setsockopt(fd, IPPROTO_SCTP, SCTP_RECVSENDFAILEVNT, &on, sizeof(on)); setsockopt(fd, IPPROTO_SCTP, SCTP_RECVPEERERR, &on, sizeof(on)); Note that for connectionless mode SCTP sockets, the caller of recvmsg will receive ancillary data for ALL associations bound to the file descriptor in use. For connection-oriented SCTP sockets, the caller will receive ancillary data for only the single association bound to the file descriptor. By default the connection oriented socket has all options off. By default the datagram oriented socket has SCTP_REVCVDATAIOEVENT and SCTP_RECVASSOCEVNT on and all other options off. The format of the data structures for each ancillary data item is given in section 3.2.2. 6.5 Helper Functions [To be filled in. They are those functions to help application programmers to deal with msghdr and cmsghdr. ] 7. Security To be filled in later. 8. Authors' Addresses Randall R. Stewart Tel: +1-815-479-8536 Cisco Systems, Inc. EMail: rstewart@flashcom.net Crystal Lake, IL 60012 USA Qiaobing Xie Tel: +1-847-632-3028 Motorola, Inc. EMail: qxie1@email.mot.com 1501 W. Shure Drive, #2309 Arlington Heights, IL 60004 USA La Monte Yarrol Tel: +1-847-632-xxxx Motorola, Inc. EMail: piggy@cig.mot.com 1501 W. Shure Drive, #2309 Arlington Heights, IL 60004 USA Jonathan Wood Sun Microsystems, Inc. Email: jonathan.wood@eng.sun.com 901 San Antonio Road Palo Alto, CA 94303, USA Kacheong Poon Sun Microsystems, Inc. Email: kacheong.poon@eng.sun.com 901 San Antonio Road Palo Alto, CA 94303, USA 9. References [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", RFC 2026, October 1996. [SCTP] R. R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. J. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and, V. Paxson, "Stream Control Transmission Protocol," , July 2000 work in progress. [STEVENS] W. R. Stevens, M. Thomas, E. Nordmark, "Advanced Sockets API for IPv6" December 1999 (Work in progress) [RFC2553] Basic Socket Interface Extensions for IPv6. R. Gilligan, S. Thomson, J. Bound, W. Stevens. March 1999.