IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 HP3000 TCP DESIGN DOCUMENT Jack Sax and Winston Edmond Bolt Beranek and Newman Inc. July 1980 IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 Table of Contents 1 Preface............................................... 3 2 Introduction.......................................... 4 3 Current HP3000 Structure.............................. 7 3.1 Processor Features.................................. 7 3.2 Network Interface Hardware.......................... 8 3.3 Operating System Software........................... 8 3.4 Input/Output....................................... 10 3.5 Interprocess Communication......................... 12 3.6 Existing INP Software.............................. 14 4 Protocol Software Architecture....................... 16 5 System Protocol Software............................. 20 5.1 Implemented Features............................... 20 5.2 Software Architecture Overview..................... 21 5.3 Control Structures................................. 23 5.3.1 Network Resources Control Block.................. 24 5.3.2 Foreign Host Control Blocks...................... 25 5.3.3 Connection Control Block......................... 26 5.3.4 Network Buffer Resources List Structures......... 26 6 User Process/TCP Interface........................... 29 6.1 Interface Intrinsics............................... 30 6.2 Flow Control Across the Interface.................. 36 6.3 Interface Control Structures....................... 36 6.4 Interface Control Algorithms....................... 37 6.5 Windowing, Acknowledgment, and Retransmission...... 48 7 1822 Layer/INP Driver Communication.................. 50 8 Protocol Software Buffering Scheme................... 52 8.1 Network Buffer Pool................................ 54 8.1.1 Packet Compaction................................ 55 8.1.2 Buffer Recycling................................. 56 8.2 User Process Buffer Pool........................... 58 9 Data Flow Through the Protocol Software.............. 60 9.1 ARPANET to the User Level Data Flow................ 61 9.2 User Level to the ARPANET Data Flow................ 64 -2- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 1 Preface This report describes a design implementation of ARPANET protocols on a Hewlett Packard HP3000 Series III computer system. Specific protocols to be implemented include a Transmission Control Protocol (TCP), Internet Protocol (IP), File Transfer Protocol (FTP), and TELNET Protocols. The reader is assumed to be familiar with the purpose of these protocols. The protocol software will run under the HP Multiprocessing Executive (MPE) operating system. The designs reflect our current understanding of the environment and the tasks ahead and may be changed as we proceed with implementation. -3- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 2 Introduction The overall purpose of this project is to modify the Hewlett Packard 3000/Series III computer system running the MPE operating system so that it converses with the ARPANET. A layered protocol approach will be used in our implementation. Protocol layers one through four represent the system layers which are responsible for moving a message reliably from one Host to another. The next protocol layer consists of a number of applications protocols which determine the content and meaning of the messages exchanged. Protocol levels one and two are X.25 LAP link access protocols. These protocols are implemented in microcode on the Intelligent Network Processor (INP) interface available from Hewlett Packard. Since the X.25 LAP protocols are different from the standard 1822 IMP Host protocols, a special X.25 IMP interface is used to link the HP3000 with the ARPANET. The interface divides standard 1822 packets into a number of X.25 frames and transfers each frame separately. The diagram in Appendix A shows the hardware configuration used to link the HP3000 to the ARPANET. The next two protocol layers consist of the DOD standard Internet Protocol (IP) and the Transmission Control Protocol -4- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 (TCP). The Internet protocol provides communication between Hosts on different networks via gateways between the networks. The Transmission Control Protocol provides reliable transmission between Hosts and performs some Host-to-Host flow control. The initial implementation will include three application layer protocols. One of these is the File Transfer Protocol (FTP), which allows a user to move files from one computer system to another. The second and third application layer protocols are User and Server TELNET. User TELNET gives the user a remote terminal capability by taking the characters from the local input device and sending them to the foreign host, and returning characters from the foreign host to the local output device (typically a terminal). The foreign host will have a Server TELNET process which acts as a pseudo-Teletype, with incoming network messages providing TTY input, and TTY output being sent to the network. The operating system treats the Server TELNET pseudo-Teletype like an ordinary terminal. Most of the protocol software is new code, the major exception being the INP microcode which is supplied by Hewlett Packard. The programs will be written in HP's Systems Programming Language (SPL), which resembles PASCAL and allows intermixing of assembly code and compiled code. In addition to new code, implementation will require changes to the MPE -5- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 operating system code, which is also written in SPL. -6- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 3 Current HP3000 Structure This section describes the HP3000 system with an emphasis on the features that affect the network software design. The description includes both the processor hardware and the operating system. Some of the operating system features described are not currently released by Hewlett Packard, but are about to be released or are part of the planned MPE IV release this fall. 3.1 Processor Features The HP3000 CPU is a medium speed machine which uses a stack architecture. It executes uncomplicated instructions in one to two microseconds. Code and data are separate and thus all code is re-entrant. There are approximately 38 hardware registers which make up the state of the processor, most of which are associated with the stack (data) and the current instruction address (code). Memory is divided into segments. A segment is a contiguous block of memory of any desired length up to 32K words. Individual segments are swapped in and out of memory as needed. Memory paging, a scheme which uses fixed size memory chunks as the basis for memory swapping, is not used in the HP3000. A -7- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 segment may be designated as code or data by the operating system. 3.2 Network Interface Hardware The interface unit between the HP3000 computer and the ARPANET machines will be HP's Intelligent Network Processor (INP). This device consists of two boards located in the HP3000 main cabinet. It is a microprogrammed interface unit whose microcode is down-line loaded by HP3000 software. HP will supply the microcode to make the INP obey the X.25 LAP protocol and will supply the device driver necessary to access the INP. The INP will be connected to a BBN C/30 (MBB) computer. This machine will convert the X.25 protocols from the INP into suitable ARPANET protocols. 3.3 Operating System Software The operating system for the HP3000 is known as the Multiprogramming Executive System (MPE). It offers both batch and interactive job capabilities and allows multiple concurrent users of either type. It offers a file system which manages files on disk, magnetic tape and/or punched cards. Some I/O devices, such as the line printer, have spooler programs built in -8- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 to the system. User programs are run as processes within MPE. Each process has associated with it a code segment and a stack (data) segment. In privileged mode, it may run in "split-stack mode", where it is allowed to have two data segments. The most common use of split-stack mode is to access tables in the operating system during system calls. The design of MPE is greatly influenced by the HP3000 hardware architecture. MPE's organization relies heavily on operations which incur little processor overhead while avoiding operations which incur large amounts of processor overhead. The most striking example of this is the MPE's dependence on user processes for a large number of what would ordinarily be considered system functions. MPE avoids the use of "system" processes to perform these functions. The design organization is a direct result of the stack architecture of the HP3000. The large number of status registers which must be saved when a new process is invoked makes process switching a very expensive operation. The time needed to perform a procedure call into a new segment of system code is typically less than the time to switch context from one process to another. Writing efficient code for this machine has thus led to organizing the system as relatively independent "utility" -9- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 routines callable by the user rather than as a collection of separate processes which manage I/O devices and system utilities. These operating system calls, called Intrinsics, are implemented as subroutine calls into system code segments. The program segments which implement the Intrinsics run in a privileged mode which allows them to directly access system tables and I/O device tables. One notable side-effect of this design is that system resources such as I/O devices are assigned to only a single program and are not normally shared. This approach has allowed the system programmers to create a complex operating system without tackling the problems of interprocess communication and resource sharing. As will be discussed later, it also has a significant effect on protocol software design. 3.4 Input/Output Input/Output operations typically consist of two steps. The first step is initiation of the desired operation. This involves checking to insure that access to the device is allowed (software protection), and issuing I/O instructions to the device to initiate the desired action. This step usually occurs as a result of an intrinsic call to the device handler code and thus is executed on the user's stack. The second step is the -10- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 operation completion handling. This may occur using either the Interrupt Control Stack (ICS) or the System Control Stack, neither of which is the user's stack. The choice of which stack to use depends on the specific device's function. A consequence of this system design is that "system code" tends to be executed using the data stack of the first user process needing the function. If process 1 wants to do an I/O operation, it invokes a system procedure which knows how to manage that I/O device. If process 2 now wishes to invoke the same device, and if the device is capable of supporting more than one request concurrently, it invokes the same routine. To avoid multiprocessing hazards in issuing I/O commands, the system procedure first checks to see if it is the first invocation of itself -- if not, it queues the request and exits; if it is, it proceeds to issue the I/O instructions. If the request was queued, it is assumed that the first process will detect the newly queued request and process it also. The first process is thus performing system functions for the second, and all later, processes, and will be charged run time for doing their work. In practice, we do not expect this to be significant, but in theory, the first process could run indefinitely, even if its own request has long since completed. -11- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 3.5 Interprocess Communication Interprocess communication under the current version of MPE is a problem. Only two techniques are currently available and neither of them is really satisfactory. One technique that may be used is that of the logical device. It is chiefly used to accomplish multiplexing of physical devices. This facility is implemented by creating a new entry in the system's Device Information Table, and by creating a set of procedures which act as a device handler. The handler will be run in privileged mode. Like other system device handlers, the procedures to manage the device are invoked directly by the user process, and the user's stack is used by the system code. This has the advantage of speed, since it avoids some process context switching. There are a number of drawbacks to this technique. First, the Device Information Table entry must be maintained as if it were a real hardware device. This requires knowledge of all the MPE internal functions that might access this table. Furthermore, since these tables are system internal, they are subject to change with each new release of MPE. Use of the table requires Privileged Mode. Bugs in the code would have a greater chance of crashing the system. The greatest drawback is that -12- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 logical devices are still under development at HP, and are more than usually likely to change over time. A new operating system feature, not yet released officially, that has been written for MPE is an interprocess communication method known at HP as message files. These correspond to Unix ports, and allow unrelated processes to communicate with one another. Each message file has one or more "reader" processes and one or more "writer" processes. During use, these files act as FIFO queues. Message files are implemented using the file system. Read, write, and query commands are all patterned after the file system commands. The message file code is designed so that if readers and writers stay more or less in synchronization, disk I/O will not be needed. However, if the writers get far enough ahead of the readers, the message file will start being spooled out onto disk. Message files are to be introduced as user level functions by HP, and, as such, their use will not change with new releases of the operating system. Code for this feature has already been implemented at HP and is available with both MPE III and the future MPE IV. They appear to be relatively easy to use and do not require knowledge of the internals of the operating system. Their chief drawbacks are that a process context switch is -13- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 required between writer and reader, and that some file system overhead is incurred. Timeouts, as seen in message files, are another new HP function that will be available. The older version of timeouts simply suspended the process for a fixed amount of time, but did not allow the process to be awakened by the completion of an I/O event during its sleep. The new version is equivalent to setting a timer whose alarm may be awaited with the same IOWAIT intrinsic that awaits I/O completion. It allows a process to wait for either some I/O device operation completion or the passage of some maximum amount of time, whichever occurs first. Alternatively, a timeout could be used to insure that waiting for a specific event will terminate if the expected event does occur soon enough. There will be both user level and system internal ways of accomplishing timeouts. 3.6 Existing INP Software The code to drive the INP is part of the CS/3000 Communications Software package from HP. It contains code to send and receive packets via the INP and code to manipulate the Device Information Tables. The code also allows the user to down-line load microcode into the INP memory. It contains intrinsics to open and close the line and to read and write -14- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 packets. The microcode allows the INP to support X.25 LAP protocols and also allows the INP to buffer up to eight 128-byte packets. These packets are read by CS/3000 as soon as possible in order to keep the INP from losing packets due to a lack of buffer space in the INP. This technique allows the INP to function as a full duplex device, even though the MPE operating system offers only a half duplex control mechanism in its software. -15- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 4 Protocol Software Architecture The protocol software architecture is dictated by a set of design requirements and MPE operating system constraints. These requirements and constraints are summarized as follows: - The new network software must be isolated from the existing operating system as much as possible. The isolation will allow any site to add or remove the network software with a minimum of effort. It will also make the network software less vulnerable to any changes HP makes to MPE. - Efficient high speed network communications are extremely important because this TCP version will be used on a production rather than an experimental basis. -_One of the problems with MPE is that, though the operating system performs device assignment and access control for its I/O devices, the user process is responsible for operating the I/O device. MPE does offer intrinsics to operate common devices, but these are very low level operations. This I/O arrangement makes it difficult to control an asynchronous network interface. The protocol software architecture will therefore require at least one process which has exclusive control of the INP interface. -16- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 - One of the properties of these network protocols is that the message acknowledgments and retransmissions occur at a relatively high level -- in the Transmission Control Protocol in layer four. A moderate amount of time passes from the time the originating TCP queues the message for transmission and the receiving TCP gets the message. In order to prevent acknowledgment delays which in turn cause the foreign host to retransmit data, the software architecture should minimize the amount of time it takes for incoming data to move through the 1822, IP, and TCP protocols. - With many network users and many connections concurrently in use, the network software must be able to handle the problems of multiplexing use of the network interface hardware. The interface on which the multiplexing takes place must support a number of simultaneous users in such a way that the behavior of any individual user does not affect data throughput of the other users. In order to meet all of the design requirements and constraints described above, the HP3000 protocol software is implemented in a set of processes (see diagram in Appendix B). One process which will be called the system protocol process is responsible for maintaining the INP interface as well as -17- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 supporting the 1822, IP and TCP protocols. The rest of the processes, called applications protocol processes, support the user interactive network functions including FTP and TELNET. The use of a single system protocol process is a key element in the protocol design. The system protocol process provides control over the INP interface by providing buffers and acting as multiplexer and de-multiplexer of network traffic to and from the INP. Use of a single process minimizes inter-protocol layer communication delays which in turn minimize the acknowledgment delays for incoming data. A single system protocol process makes it possible to use interprocess communication primitives to provide a uniform network interface for the applications level protocol processes. User TELNET and User FTP protocols are to be implemented as ordinary user programs. They use the same system calls as any other network accessing program, but are written to provide a higher level command language for the user. As user programs, they execute in the user's address space with the privileges normally available to the user. The User TELNET and User FTP programs are re-entrant, with as many processes running this code as users wishing the service. Server TELNET is a single process created as the system starts up or whenever the first need for it arises. De- -18- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 multiplexing of Server TELNET inputs is accomplished via a pseudo-teletype driver. The driver acts as the interface between the Server TELNET process and the Teletype handler. The interface between application protocol processes and the system protocol process is through a set of TCP intrinsics. The intrinsics are designed to form a uniform interface between the user and the TCP. Actual data communication between a user process and the system protocol process is done with a combination of message files and direct buffer-to-buffer transfers. Message files are used to pass flow control information while the actual data transfer is made by copying data between user buffers and system protocol buffers. The combination of message files and buffer copy is used to take advantage of the flexibility of message files and the data rates achieved by direct data copy. -19- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 5 System Protocol Software Since this TCP implementation is to be used on a production rather than an experimental basis, the design effort has concentrated on the efficiency rather than the sophistication of the protocol software. This is especially true of the system protocol software whose initial design includes only those features needed to support the FTP and TELNET protocols. At the same time, the software design does allow for the future enhancement of the protocol software. There are no inherent design limitations which will prevent implementation of the more sophisticated TCP and Internet features. 5.1 Implemented Features The specific TCP and Internet features to be implemented include the following: -20- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 - multiple connections to multiple hosts, - flow control at the 1822, Internet, and TCP layers, - error recovery, - fair allocation of resources among all connections, - handling of urgent data, - surviving incorrect packets, - datagram reassembly, - routing, - source quenching. 5.2 Software Architecture Overview The system protocol software architecture reflects the need to avoid packet processing delays rather than a strict hierarchy between protocol layers. The system protocol software is implemented as a single process to allow the system protocol layers to share software resources for greater efficiency. The shared resources include subroutines which perform functions required by more than one protocol layer and a common buffer pool to optimize storage resources and to allow efficient communication between protocol layers. Network traffic through the system protocol process takes different forms including 1822 packets, datagrams, and TCP segments. These various forms are generically referred to as -21- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 "packets". Packets are passed into the system protocol process from either an applications protocol process or the ARPANET interface. Packets from the ARPANET are passed into the system protocol process by intrinsic calls to the INP interface. User generated network packets are passed to the system protocol process by using a combination of message files and data buffers. Message files are used to transfer control and status information while data transfer is done with buffer-to-buffer copies between the user protocol data segment and the system protocol data segment. All read and write commands are done without wait to allow the system protocol process to simultaneously multiplex I/O channels and process network packets. I/O multiplexing is implemented through the IOWAIT intrinsic. The system protocol process issues an IOWAIT intrinsic after it finishes processing a data packet. The IOWAIT intrinsic returns the file number of the I/O channel associated with an I/O completion wakeup. When the number of free buffers falls below a prescribed limit, an attempt is made to free buffers through data compaction. The attempt begins with a search for datagram fragments and unacknowledged TCP segments which waste buffer space by using only a fraction of the available space in each buffer assigned to them. This lack of efficiency can be -22- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 particularly damaging because there is no guarantee that the data contained in the buffers will ever be processed. Wherever possible, datagram fragments are combined into a single datagram fragment and TCP segments are combined into a single segment to more efficiently utilize system buffers. Any buffers freed by this compaction process are returned to the freelist. Network packets from both the user process and the ARPANET are processed along one of a number of data paths in the system protocol process. The actual data path taken depends on the type of data packet and, in the case of TCP segments, the state of its associated network connection. Packet processing is performed by a series of function calls which act as processing steps along the data path. In order to avoid processing delays which can tie up system resources, each arriving data packet is processed through as much of the protocol software as possible. Processing of a packet is suspended only when the lack of some resource or some external event prevents further processing. 5.3 Control Structures All of the status information both for individual network connections and for the system protocol software as a whole is -23- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 kept in a set of control blocks as well as in a number of buffer list structures as shown in Appendix C. The control blocks include a general network resources control block, a foreign host control block for each foreign host connected to the local host, and send and receive control blocks for network connection. The list structures include a network buffer free list, a TCP buffer aging list and an Internet buffer aging list. 5.3.1 Network Resources Control Block The Network Resources Control Block contains the information needed to maintain the network buffer free lists and aging lists. This information includes pointers to the network buffer free lists and aging lists and a count of the buffers in each of the lists. The information contained in the Network Resources Control Block is used by the protocol software to control the distribution of network buffers among the various lists. The information is scanned at various times to determine the allocation or disposition of a particular network buffer. The determinations occur when new buffers are allocated from the free list and when buffers containing TCP segments are about to be acknowledged. Decisions are made based on the number of free buffers available and the priority of the task requiring the -24- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 buffers. 5.3.2 Foreign Host Control Blocks Foreign Host Control Blocks maintain flow control within the 1822 protocol layer. The block contains a counter for the number of outstanding 1822 packets sent to a single host. The counter includes all of the packets sent to the host on all sockets. The counter is incremented when an 1822 packet is sent and is decremented when either a RFNM or an Incomplete Transmission is received from the host. The counter is used to prevent transmission of too many 1822 packets to a single host. All transmission from the host is blocked when the counter reaches the limit of eight outstanding 1822 packets for any foreign host. The 1822 level flow control is actually implemented by the send side of the TCP software. The TCP checks the RFNM count in the connection control block before it tries to transmit a segment to the foreign host. -25- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 5.3.3 Connection Control Block Each TCP connection has an associated control block. The control block contains data associated with the Transmission Control Block (TCB) along with other connection related information. Specific information included in the control block is as follows: - a connection state variable used to maintain the connection state, - the local port number of the connection, - the TCP interface control block number associated with this connection, - the file number of the private message file associated with this connection, - the TCB data associated with the receive side of this connection, - the TCB data associated with the send side of this connection, - A pointer to any buffers containing unacknowledged data received on this connection. 5.3.4 Network Buffer Resources List Structures Three list structures are used to maintain the network buffer resources shared by all of the sockets. These list structures include the free list and the two buffer aging queues. The network buffer free list contains all of the network buffers currently available for use by any socket. These buffers -26- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 are allocated when new data comes in from either the network or a user protocol process. The Internet Aging Queue is a list of active buffers assigned to blocked datagram fragments and complete datagrams. These buffers are the first to be reclaimed when there are no free buffers available. The Queue is sorted according to datagram age. All buffers which belong to the same datagram are combined into a single list structure. The datagram list structures are linked into the Internet Aging Queue with the least recently updated datagram always at the head of the queue. When a new datagram fragment comes in it is moved to the end of the queue along with all of the other fragments which belong to the same datagram. The TCP Aging Queue is a list of buffers which contain at least parts of unacknowledged TCP segments. These buffers can be reclaimed when there are no free buffers and no buffers on the Internet aging list. The Queue is sorted by socket. All buffers which contain data for the same socket are combined in a buffer list and each buffer list is linked into the queue. The queue is sorted by the age of the data associated with each socket. Data belonging to the socket which has been inactive for the longest period of time is placed at the head of the queue so it can be recycled first. When a user process reads data from a -27- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 connection, all the network buffers still waiting to be read on that connection are moved to the end of the TCP aging list. This assures that data associated with an active TCP connection will not be recycled ahead of data associated with an inactive TCP connection. -28- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 6 User Process/TCP Interface The user process/TCP interface is designed to meet two basic requirements. First, the interface must allow for high speed data transmission across the interface; this is especially important since this interface involves interprocess communication which could be delayed by excessive system overhead due to context switching and process scheduling. Second, the interface must isolate the system protocol process from any buffer overhead burdens caused by processing delays in the user process. System protocol process buffers are too valuable a resource to be locked into storing TCP segments which are waiting for response from a user process. High speed data transmission across tser process/TCP interface is achieved by copying data directly from buffers in the user process to buffers in the system protocol process. The direct transfer is implemented with the move-to-data-segment and move-from-data-segment instructions provided by the HP3000. The system protocol process is isolated from delays in the user process by making the user process responsible for buffering TCP data segments. Acknowledged incoming TCP segments, and TCP segments waiting to be transmitted over the ARPANET, are stored in buffers in the user protocol process. This use of user buffers serves two functions: it frees system protocol buffers -29- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 from being locked into storing TCP segments, and also gives the user process some control of network connection throughput. Throughput control is accomplished by allowing each user process to choose the amount of buffer resources it dedicates to each connection. 6.1 Interface Intrinsics The TCP/user interface is implemented through a set of TCP intrinsics. These intrinsics allow the user process to create and use network connections with other processes on foreign hosts. Seven intrinsics, TCPWAIT, TCPOPEN, TCPCLOSE, TCPRECEIVE, TCPSEND, TCPSTAT, and TCPABORT, provide the basic control functions needed to transfer data through the user process/TCP interface. Conceptually, the intrinsics allow the user to create network connections with other processes on foreign hosts. Each connection consists of a pair of sockets as defined in the TCP protocol document. Connections are created with a TCPOPEN intrinsic whose parameters define the sockets which make up the connection. After a connection is created, the user process uses the TCPSEND and TCPRECEIVE intrinsics to send or receive data. The TCPSTAT intrinsic allows the user to check the status of a connection. -30- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 Within the user process, connections are identified through the combination of a connection file number and a connection buffer. The connection file number is returned by a successful TCPOPEN call. The connection buffer is an integer array allocated by the user process. The buffer is initialized by the TCPOPEN intrinsic and is then passed as the first parameter to all of the other TCP intrinsics. It is the responsibility of the user process to maintain the association between the connection file number and the connection buffer. The TCP I/O interface is entirely asynchronous so that a user process can queue any number of read or write requests to a particular connection. The user process has three limitations in this regard: first, it must provide the buffers associated with each TCP intrinsic call; second, the user process must keep track of which buffers are associated with each I/O call; and third, the user process cannot dequeue buffers until it has permission to do so from the system protocol process. The user process uses a combination of the IOWAIT ane TCPWAIT intrinsic calls to keep track of I/O completion events. The IOWAIT intrinsic is invoked when the user process has completed processing all of the current data. When the IOWAIT intrinsic returns with a file number associated with a TCP connection, the TCPWAIT intrinsic is called to handle the I/O -31- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 completion. The TCPWAIT intrinsic uses the connection buffer to determine the cause of the I/O completion and then performs the appropriate I/O cleanup task and returns an I/O type code to the user process. The specific calling sequences of the TCP intrinsics are given below: TCPOPEN(TCPCBUF,FHIA,FP,A/P,LP[,BADDR]) opens a TCP connection TCPCBUF - TCP Connection Buffer - This is a pointer to an integer array ten elements long which acts as the control structure for all network connections. The array is allocated by the user process before any TCP intrinsics are called. FHIA - Foreign Host Internet Address - 32 bit address. This address may be obtained with the HOSTADDR intrinsic which takes the host name text string as a parameter and returns the 32 bit internet address. In the case of a passive open a zero address indicates a listen for any host. FP - Foreign Port - a 16 bit port address for this connection at the foreign host. In the case of a passive open a 0 port indicates a listen from any port on a foreign host. -32- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 A/P - Active/Passive - a boolean flag used to indicate if this open is for a listen socket (passive) or for an active connection. LP - Local Port - 16 bit local port id. This parameter is optional. If it is not specified, the TCP picks a free local port id from a reserved part of the name space. BADDR - Buffer Address - an optional buffer used to give the foreign host a window for transmission. If the buffer is not provided, the connection is opened with a zero window size until the user process calls the TCPRECEIVE intrinsic. returns - local connection name or error code of -1 if the connection failed. The local connection name is actually the file number of the private message file associated with this connection. TCPCLOSE(TCPCBUF) closes a TCP connection TCPCBUF - TCP Connection Buffer - same as in the TCPOPEN intrinsic. TCPABORT(TCPBUF,BUFPTR) aborts a TCP connection TCPCBUF - TCP Connection Buffer - same as in the TCPOPEN -33- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 intrinsic. BUFPTR - Buffer Pointer - pointer to a list of buffers released by the TCPABORT call. A zero value indicates that no buffers were released. TCPRECEIVE(TCPCBUF,BADDR,BCNT) reads data from a TCP connection TCPCBUF - TCP Connection Buffer - same as in the TCPOPEN intrinsic. BADDR - Buffer Address - address of user buffer for receiv- ing network data. BCNT - Byte Count - number of bytes to be transferred. returns - an error code of -1 if the TCPRECEIVE failed. TCPSEND(TCPCBUF,BADDR,BCNT,EOL) writes data to a TCP connection TCPCBUF - TCP Connection Buffer - same as in the TCPOPEN intrinsic. BADDR - Buffer Address - address of user buffer for sending network data. BCNT - Byte Count - number of bytes to be transferred. EOL - End Of Letter - a boolean flag to indicate that this buffer is an end of letter. -34- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 TCPSTAT(TCPCBUF,SBADDR) gives TCP connection status TCPCBUF - TCP Connection Buffer - same as in the TCPOPEN intrinsic. SBADDR - Status Buffer Address - address of user buffer for receiving network data. TCPWAIT(TCPCBUF,BUFPTR,DATAPTR) returns the result of a previous TCP intrinsic call. TCPCBUF - TCP Connection Buffer - Same as in the TCPOPEN intrinsic. BUFPTR - Buffer Pointer - used to return a pointer to a buffer list released by an I/O event. A zero pointer indicates that no buffers were released. DATAPTR - Data Pointer - pointer to the first data element within a buffer returned by the intrinsic to a TCPRECEIVE intrinsic. returns - a code indicating the type of I/O completed. A list of the I/O codes is given in Appendix D. -35- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 6.2 Flow Control Across the Interface Flow control across the user process/TCP interface is implemented through the use of message files. The message files act as control channels to transmit flow control and status messages between the user process and the TCP. Each connection makes use of two message files. A general input message file is used to transmit control messages from user processes to the TCP. All user processes share the same general input message file. Each connection also uses a private message file to convey control and status information from the system protocol process to the user process. The control messages passed between the user process and the system protocol process are short data buffers. These buffers contain the message type along with other information associated with the particular command. The formats for each of the message types is shown in Appendix D. 6.3 Interface Control Structures Each network connection has an associated TCP interface control block. These blocks include a set of pointers and data segment numbers used to keep track of buffers within both the user process and the system protocol process. The pointers -36- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 contain buffer addresses relative to the beginning of the stack data segment for each process. A diagram of the TCP interface control block is given in Appendix E. The control blocks are maintained in a separate data segment shared by both the user and system protocol processes. The data segment is initialized by the system protocol process when it starts up. The initialization of the data segment divides it into a number of control blocks. Individual control blocks are initialized by the TCPOPEN intrinsic. Responsibility for releasing the control blocks is shared among the TCPCLOSE, TCPABORT, and TCPWAIT intrinsics. 6.4 Interface Control Algorithms The specific functions performed by each of the network I/O intrinsics are as follows: TCPOPEN 1. The TCPOPEN intrinsic software searches for a free TCP connection interface control block and initializes it. 2. The TCPOPEN software creates a private message file with a unique file name. The unique file name is formed out of the prefix "TCP" and the id number of the TCP -37- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 interface control block. 3. The TCPOPEN software sends an OPEN CONNECTION command message to the TCP via the general input message file. The message includes all of the TCPOPEN parameters and the id number of the TCP interface control block. 4. The TCPOPEN software makes a read request with timeout on the private message file. If the read times out, the TCPOPEN software sends an ABORT CONNECTION command to the TCP, deletes the TCP interface control block, and returns an error code to the user process. The connection buffer provided as a parameter to TCPOPEN is used as the read buffer. 5. The TCP software reads the open command from the general input message file and uses the information to create a connection control block. The TCP software also initiates the connection protocols specified in the command message. 6. The TCP software sends an OPEN CONFIRM message back to the user process via the private message file created by the TCPOPEN intrinsic software. The OPEN CONFIRM message will fail if the user process is destroyed by a user abort. If this occurs, the TCP software takes -38- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 responsibility for cleaning up the TCP interface control block as well the connection control blocks. 7. The TCPOPEN software reads the OPEN CONFIRM message from the private message file. The TCPOPEN software initiates a read without wait on the private message file. The connection buffer is again used as the read buffer. 8. If the user provides a read buffer as the last parameter to the TCPOPEN intrinsic, a read operation is initiated. The TCPOPEN software attaches the buffer to the TCP interface control structure and sends a RECEIVE message to the TCP via the general input message file. The TCP uses this message to set the size of the connection window. 9. The TCPOPEN software returns the file number of the private message file to the user process. TCPCLOSE 1. The TCPCLOSE software marks the connection closed bit for the send side in the TCP interface control block. The TCPCLOSE software checks to see if there are any data buffers waiting to be read by the TCP. If there are no data buffers, the TCPCLOSE software sends a CLOSE -39- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 CONNECTION command to the TCP. If the receive side is marked closed and there are no buffers waiting to be read, the TCPCLOSE intrinsic software deletes the TCP interface connection control block. 2. The TCPCLOSE software returns to the user process. 3. When the TCP receives a connection close command from the user process it sends a FIN to the foreign host and marks the send side of the connection as FINWAIT-1. When the TCP receives an ACK of the close the foreign host, it marks the send side of the connection as FINWAIT-2. If the receive side of the connection is marked closed, the TCP deletes the connection control block. 4. If the TCP receives a FIN from the foreign host, it marks the receive side of its connection as closing. When all data and the FIN sent by the foreign host are ACKED, the TCP sends a NETCLOSE command to the user process and marks the receive side of the connection closed. If the send side is also marked as closed, the TCP deletes the connection control block. The close message sent to the user process is processed by the TCPWAIT intrinsic. -40- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 TCPABORT 1. The TCPABORT software sends an ABORT CONNECTION command to the TCP via the general input command file. The TCPABORT software releases the TCP interface control block and deletes the connection's private message file. 2. The TCPABORT software returns to the user process. The return includes a pointer to a list of user buffers which were assigned to the connection. 3. When the TCP receives an ABORT CONNECTION command from the user process it sends a reset to the foreign host, deletes any unacknowledged data it has for this connection, and deletes the connection control block. 4. If the TCP receives a reset from the foreign host, it deletes all of the data waiting to be transmitted to the user process and sends a NETABORT message to the user process via the private message file. The NETABORT message is handled by the TCPWAIT intrinsic. TCPRECEIVE 1. The TCPRECEIVE software checks to see if the receive side connection closed flag is set. If the flag is set, the TCPRECEIVE software returns a connection closed -41- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 indication to the user process. It is up to the user process to close the send side of the connection and clean up the connection buffer if it has not done so. 2. If the connection is open, the TCPRECEIVE software attaches its read buffer to the TCP interface control block and sends a RECEIVE message to the TCP. The message is used to indicate to the TCP that the user has made a buffer available to the connection. The TCPRECEIVE software returns to the user process. 3. When the TCP receives the user's read message, it checks to see if it has any unacknowledged segments waiting to be transferred to the user process. If it has no segments, it uses the RECEIVE message to increase its receive window size. If the TCP has segments waiting for transfer, it transfers as much of the data as possible to the user process. All transferred data is immediately acknowledged to the foreign host. The TCP sends a PENDING RECEIVE message to the user process to advise it of the transfer of data. This message is processed by the TCPWAIT intrinsic. 4. If the TCP receives data from the foreign host, it checks to see if the user process has assigned any free buffers to this connection. If there are free buffers, -42- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 the TCP copies as much of the data it receives as possible into the user buffers and acknowledges the copied data to the foreign host. Any data which is not copied is maintained on the TCP aging list where it is stored until it is transferred to a user process buffer or discarded. The user process is informed of the data transfer through a PENDING RECEIVE command message via the private message file. This message is received by the TCPWAIT intrinsic. TCPSEND 1. The TCPSEND intrinsic checks to see if the connection is still open. If the connection is marked closed, the TCPSEND returns an error code to the user. 2. If the connection is still open, the intrinsic software attaches the user supplied data buffer to the TCP interface control block. The TCPSEND software sends a SEND message to the TCP via the general input message file. The TCPSEND software now returns to the user process. 3. The TCP software, on receiving the data SEND message, checks to see if it can send the data to the foreign host. The decision on whether to send the data is made -43- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 by checking the following conditions: - the foreign host has advertised sufficient window space, - the number of outstanding RFNMs for all connections to the foreign host is less than eight, - the amount of data waiting to be sent is sufficient to warrant a data packet. This condition prevents single byte segments from being sent out over the ARPANET. The TCP waits until it has at least 10 bytes of data before transmitting it out to the ARPANET. - the user has specified an EOL. If the TCP decides to send the data, it prepares a network packet and copies as much of the user data as it can transmit into the network packet. The data transfer is made directly from the list of user buffers queued by the TCPSEND intrinsic to the message packet buffer. All buffers filled by the data transfer are marked as filled and appended to the filled buffer list. 4. After the TCP has transferred all of the data from the user buffers, it checks the TCP interface control block. If the send side of the connection is marked closed, the TCP sends a Fin to the foreign host. If the receive side is also closed, the TCP sends a NETCLOSED command to the user process. 5. After the data is transmitted, the TCP sets a -44- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 retransmission timer. 6. If the TCP receives an acknowledgment from the foreign host, it updates the TCP interface control block to reflect the acknowledgment, turns off the timer, and sends a DATA SENT message to the user process via a connection private message file. The message contains the number of bytes acknowledged. This message is processed by the TCPWAIT command. If only some of the data is acknowledged, the TCP resets the timer for the unacknowledged data. 7. If the TCP does not get an acknowledgment from the foreign host and the connection times out, it again reads as much data as it can from the user buffer and sends it out as a network packet. TCPSTATUS 1. The TCPSTATUS software checks to see if the connection is still open. If it is closed, it returns a connection closed code to the user process. 2. The TCPSTATUS command checks to see if there is an out- standing status request by the user process. If there is, it returns an error code to the user process. -45- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 3. If there is no pending status request, the TCPSTATUS software attaches the status request buffer to the TCP interface control block and sends a STATUS message to the TCP via the general input message file. The TCPSTATUS returns to the user process. 4. When the TCP receives the status request message, it formulates a status message and copies it into the user's status buffer attached to the connection buffer. The TCP then sends a status complete message to the user process via the connection private message file. The message from the TCP is processed by the TCPWAIT intrinsic. TCPWAIT 1. The TCPWAIT software checks the message received from the TCP. 2. If the message is a NETCLOSE command, the TCPWAIT software checks if the send side of the connection is closed and there is no data waiting to be sent to the TCP. If the send side is closed and there is no pending TCP data, the TCPWAIT software deletes the TCP interface control block. If there is data waiting to be transmitted, the TCPWAIT software marks the receive side -46- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 of the connection closed. In either case, TCPWAIT returns a connection closed code to the user process. It is up to the user process to decide when to close the send side of the connection, if it has not already done so. If there are any user buffers still assigned to this connection, they are returned to the user process at this time. 3. If the message is a NETABORT command the TCPWAIT software deletes the TCP interface control block and returns a connection abort code to the user process. Any buffers associated with connection are also returned in a list structure through the buffer pointer parameter. 4. If the message is a PENDING RECEIVE command, the TCPWAIT returns the pointer to the head of the first data buffer, the first data byte, and a byte count. Since the data may be returned in a number of linked buffers, it is up to the user to follow the buffer links. As the user process reads the data it should check each buffer's header. Completely filled buffers marked with a zero in the in use field can be reclaimed by the user process. 5. If the message is a DATASENT message, the TCPWAIT -47- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 software checks the acknowledgment count and releases as many buffers as it can from the send buffer list. The released buffers are linked in a list and the buffer pointer parameter is set to point to the first buffer in the list. The TCPWAIT software returns a data acknowledgment code to the user process. 6. If the message is a STATUS COMPLETE message, the TCPWAIT software sets the buffer pointer parameter to point to the status buffer and returns a status complete command code to the user process. 6.5 Windowing, Acknowledgment, and Retransmission The receive window size and data segment acknowledgment are completely dependent on the number of buffers the user process allocates to a connection. The receive window size of a connection is always set to the amount of free buffer space the user process allocates to the receive side of a connection. Acknowledgments of incoming TCP segments are limited to those sequence numbers which fit in the receive window. Acknowledgments are sent as soon as data is copied from the system protocol buffers to the user protocol buffers. -48- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 The initial retransmission algorithm is extremely simple. The first retransmission of unacknowledged data occurs 3 seconds after the original transmission. The second retransmission occurs 6 seconds after the first. The third and successive retransmissions occur 15 seconds after the previous retransmissions. -49- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 7 1822 Layer/INP Driver Communication Communication between the system protocol process and the INP driver is implemented with four intrinsics: IOPEN, ICLOSE, IREAD, and IWRITE. These intrinsics are modified forms of the CS/3000 intrinsics. Their function is to open a connection to the INP network processor and to transmit data buffers to and from the INP. The IREAD and IWRITE intrinsics are always done without wait. The IOWAIT intrinsic is used to determine the completion of an I/O request. Initialization of the INP interface begins with an IOPEN call which initializes the interface software. This is followed by four IREAD intrinsic calls to initialize buffers for incoming network packets. Four pending buffers should allow enough buffering to catch all of the incoming data without tying up too many network buffers. The following is a summary of the commands used to communicate between the protocol process and the INP driver. - IOPEN() returns error code on failure. Possible failure modes include failure to find the INP microcode file, failure to load the microcode file in the INP, and a hardware failure in the INP. -50- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 This command initializes the connection between the protocol process and the INP. The initialization includes activating the INP and loading its microcode. - ICLOSE() This command closes the connection between the INP and the protocol process when the network software is brought down. - IREAD(buffer) This intrinsic passes an empty buffer to the INP driver. The buffer is queued to a DIT with an ATTACHIO command. Control then returns to the protocol process. - IWRITE(buffer) This intrinsic passes a full buffer to the INP DIT with the ATTACHIO command. Control is returned after the buffer is attached to the DIT. The buffer is released when the calling process receives an interrupt indicating I/O completion. -51- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 8 Protocol Software Buffering Scheme Data buffer management is the most important component of the network protocol software. Data buffers perform the key functions of data storage and data communication within the protocol software. These functions have complex and conflicting requirements which must be balanced by the buffer management scheme. An understanding of the buffer management scheme therefore begins with an understanding of its requirements. First, data buffers must be considered a scarce resource shared by a number of competing "interests" within the protocol software. These "interests" include the various protocol layers as well as individual network connections within the TCP layer. The major problem is how to effectively allocate buffer resources among these interests. This problem becomes particularly difficult when there is a shortage of buffers. An examination of the buffer requirements shows that they fall into two categories. The first category includes those buffers used to support general network functions. This includes buffers used in the 1822 and Internet protocol layers. These buffers are assigned to move and store data in these protocol layers without regard to particular network connections. The second category includes those buffers used by the TCP protocol to support specific connections. -52- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 The distinction between the two buffer categories is important because buffer use within each category is controlled by a different set of events. The use of buffers assigned to the general network functions can be controlled by the system protocol software. Buffers are processed through the Internet and 1822 protocol layers without regard to the behavior of user processes and their affect on individual connections. Buffers assigned to the connection specific network functions in the TCP and higher level protocol layers are greatly affected by events which occur in user processes. The rate at which data is accepted from or transmitted to the ARPANET by user processes is totally unpredictable. This unpredictability makes it difficult for the system protocol process to effectively assign buffer resources to individual network connections. Two buffer pools are used to separate those buffering functions shared by all network connections from the connection specific buffering functions. A network buffer pool, maintained by the system protocol process, is used to support the 1822 and Internet and some TCP buffering functions. A user buffer pool, maintained by each user protocol process is used to support connection dependent buffering functions for the TCP and higher level protocols. -53- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 8.1 Network Buffer Pool The network buffer pool supports the following specific functions. - movement of network packets from the INP driver 1822 and Internet protocol layers; - storage of Internet datagram fragments in the Internet protocol layer; - storage of unacknowledged TCP segments which do not fall into the current window; - movement of network packets from the TCP layer through the Internet and 1822 layer to the INP driver. The network buffer pool is maintained on the system protocol process stack where it can be accessed easily by the various system protocol layers. All of the buffers in the pool are the same size to minimize the amount of software overhead needed to maintain the buffers. The buffer size is matched to the maximum frame size (128 bytes) which may be transmitted over the X.25 link between the INP and the ARPANET IMP. The size choice is the result of two constraints. First, the buffers used to catch incoming data must be large enough for the largest incoming network packet. The packets are transferred directly into memory by the INP hardware making it impossible for a packet to cross buffer boundaries. Second, the single size buffer scheme limits the amount of software overhead needed to -54- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 maintain the buffer pool. The single size buffer scheme does not waste buffer space because the buffer size is well matched with the data it processes. The 128 byte buffer size allows room for all of the protocol headers and a small amount of data. Messages with more data will use multiple buffers. The buffers are large enough to hold a significant amount of data yet small enough to limit the waste caused by partially filled buffers. No attempt is made to assign network buffers to any particular protocol layer or task. Buffers are allocated either when data is read from the ARPANET or when the TCP layer sends data out to the ARPANET. 8.1.1 Packet Compaction When the total number of network buffers in the free list falls below a set value, a data compaction algorithm is invoked. This algorithm searches for partially filled buffers used to store Internet datagram fragments and unacknowledged TCP segments waiting to be transferred to a user process. These buffers are chosen because processing of the data in them is indefinitely suspended. Compaction of the data in these buffers allows some of the buffers to be released to the free list. -55- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 8.1.2 Buffer Recycling A buffer recycling algorithm is invoked when the system protocol process runs out of free network buffers. The algorithm allows buffers to be reused even if they currently contain data. The mechanism works by identifying which data buffers can be reused without losing irreplaceable information. These buffers are sorted in a priority scheme wallows the least important buffers to be recycled first. The buffer recycling scheme prevents one socket from tying up too much of the network buffer resources. It also helps assure a supply of network buffers even under heavy load conditions. The buffer algorithm scheme divides network buffers into three categories: free buffers, in-use buffers, and aging buffers. Free buffers are available for immediate use by any protocol layer and are maintained on a common free list. In-use buffers are buffers bound to messages currently being processed and cannot be used for any other purpose. Aging buffers are used in messages where processing is suspended for any number of reasons. These buffers are placed in one of two special lists arranged in order of decreasing age. That is, message buffers which have been blocked for the longest time are at the front of the queue, while the message buffers which were most recently blocked are at the end of the queue. -56- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 There are two points in the protocol software where messages may be blocked. The first point is in the Internet Protocol software. Fragmented datagrams cannot be passed on to the TCP and can be blocked indefinitely if one or more of the fragments which make up the datagram is lost. A duplicate datagram may eventually be transmitted leaving the fragmented datagram in a suspended state. The second blocking point is in the TCP software. Unacknowledged segments sent by a foreign host remain suspended in the TCP until they are transferred to a user process buffer. Any segments which are not transferred to a user process will remain blocked indefinitely. Buffer recycling is implemented through buffer aging lists which are adjuncts to the buffer free list. When an incoming message is blocked, its buffers are attached to the end of one of two aging lists. Buffers bound to datagram fragments are attached to one aging lists while buffers bound to TCP segments waiting to be read by user processes are attached to the second aging list. The aging lists are periodically manipulated when a new datagram fragment comes in or when a user process receives some data from the TCP. Buffers associated with the particular datagram fragments or TCP segments are moved to the end of their respective aging lists. This helps assure that any data which -57- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 has a chance of being used will not be thrown away. The buffers bound to fragmented datagrams are recycled first because they are the most expendable. Blocked datagram buffers may be a part of datagrams which have been retransmitted and passed on to the TCP. When the blocked datagram buffers are exhausted the buffers bound to blocked TCP segments are used. These buffers contain the unacknowledged segments which have not been claimed by a user process. The assumptions here are that the user process will never claim these segments and that they are expendable. User Process Buffer Pool The user process is responsible for maintaining a set of fixed length buffers for passing the user data to the TCP. Each buffer must include a four byte header along with 80 bytes of data space. The first element of the header is used as either a byte count or a full buffer marker. The count is used by the TCPSEND intrinsic to indicate the number of data bytes in the buffer. The TCPRECEIVE intrinsic uses the buffer full marker to identify buffers which may be reclaimed by the user process. -58- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 The second element in the array header contains a list pointer. This pointer is maintained by the intrinsic software and should not be altered by the user process until the buffer is released. -59- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 9 Data Flow Through the Protocol Software Data flow through the protocol software is effected through a series of tests and function calls. The tests check the type and processing state of each packet while the function calls perform specific operations on each packet. These operations include such things as creating or checking headers and queueing or de-queueing packet buffers. Whenever possible, network packets are processed through all of the system protocol layers without interruption. This helps increase throughput by minimizing two important parameters. First, the amount of buffering required to process data is decreased because all network buffers associated with a packet are released when the packet has passed through the protocol software. Second, the time between the receipt of a packet from the ARPANET and the transmission of an ACK is reduced. There are a number of instances when the processing of a packet can be interrupted within the system protocol process. This can occur when the lack of some resource or event prevents further processing. Examples of this are as follows: - Internet datagram fragments waiting for reassembly; - TCP segments from a foreign host waiting to be read by a user process; - TCP segments from a user process waiting for window -60- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 allocation before being transmitted to the ARPANET; - TCP segments from a user process already sent to the ARPANET but waiting for an acknowledgment. 9.1 ARPANET to the User Level Data Flow Data packets come in from the network via a DMA interface to the INP network processor. Incoming data is first transferred into the protocol process via network buffers passed to the IREAD intrinsic which places a read request on the DIT queue of the INP. An arriving network packet is placed in the network buffer by the INP driver. The system protocol process is notified of each I/O completion through the IOWAIT intrinsic. Processing of network packets begins when an IOWAIT call returns on completion of an IREAD intrinsic. The first processing step is to link the network buffers which contain the pieces of an 1822 packet. The next processing steps are performed by the 1822 protocol software. If this is a normal data packet the 1822 header is removed and the data packet is passed as a datagram to the Internet Software. The transfer is done by calling a sequence of Internet protocol routines with the datagram as a parameter. -61- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 The Internet software checks the datagram header for integrity and then tries to find the proper address for this datagram. If the datagram is not for the local host it is routed to the proper ARPANET Host and the network buffers are returned to the free list. If the datagram is a fragment of a larger datagram it is linked to any existing fragments waiting to be processed. If the new fragment does not complete the incoming datagram, the fragment is placed in an aging buffer queue next to the youngest buffer in the partially complete datagram. At this point all processing on the incoming datagram is suspended until the rest of the datagram fragments arrive. A complete datagram is stripped of its Internet header and sent to the TCP software as a data segment. The TCP performs a number of functions on incoming segments: first the segment header is checked to see if it belongs to a known socket -- if it does, any acknowledgment information from the header is taken to update the socket status; next, the segment is checked to see if it falls within a window -- if it is not within the window (or a reasonable approximation thereof), the segment is discarded and its buffers are returned to the free list. Accepted TCP segments are transferred into the user buffers. The transfer is initiated by the user process which provides a -62- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 buffer through the TCPOPEN or TCPRECEIVE intrinsic. A command message sent via the general input message file is used to inform the system protocol process that a buffer is available. The system protocol process transfers as much of its segment as possible to the user buffer. The user process is then notified of the data transfer via the connection's private message file. Only the transferred portions of the segments are acknowledged to the foreign host. Any portions of segments which do not fit in the receive window are stored in the TCP aging queue. The acknowledgment may be sent in a number of ways. If the same network connection has an outgoing packet waiting for transmission, the acknowledgment information is added to the outgoing packet. If there is no pending outgoing packet, a check is made to see if there is sufficient unacknowledged data to warrant an acknowledgment packet. If there is enough information, a separate acknowledgment packet is generated and transmitted out to the ARPANET as if it were a normal message. If the number of unacknowledged segments is insufficient to justify an acknowledgment packet, the pending acknowledgment bit in the TCB is set and a timer is started. If the timer runs out, an acknowledgment packet is sent regardless of the number of unacknowledged segments. -63- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 9.2 User Level to the ARPANET Data Flow Transfer of data from the user process out to the ARPANET begins with a NETSEND intrinsic call. The intrinsic software sends a message to the system protocol process to inform it that it has data to send. The system protocol process tests the state of the connection to see if data transmission is feasible. The following are sufficient conditions for data transmission out to the ARPANET: - enough data has collected to justify transmitting it to the foreign host; - the user process has specified an EOL in the data transmission; - there are fewer than eight outstanding 1822 protocol packets waiting for RFNMs to the foreign host; - the waiting data falls within the foreign host's window. If the state of the connection does not allow a transmission to occur, a request-to-send data flag is set in the connection control block. When the connection state changes due to some external event, a check is made to see if the new state allows the transmission of waiting data. An example of such an event is the arrival of a RFNM from a foreign host; in this case all of the connections to the foreign host are checked for data waiting for transmission. The connection with data which has been waiting for the longest time is processed first. An attempt is -64- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 made to combine as many of the waiting TCP segments as possible into one data transfer to increase the amount of data transmitted. If there is nothing blocking transmission of the data, the TCP software allocates a buffer, creates the necessary TCP, Internet, and 1822 headers, and copies the data to be transmitted from the user buffer to the system's buffer. The TCP header will include any acknowledgment information for data received on the return socket associated with the connection. In order to assure proper transmission of the TCP segment a retransmission sequence is started. A retransmission timer is started to wake up the protocol software when a retransmission is needed. If a timeout occurs, the segment is retransmitted as soon as the state of the connection allows it. The necessary conditions for a retransmission are the same as those for the original transmission. If the segment is partially acknowledged, the data left in the retransmission queue is only that data represented by the unacknowledged sequence numbers. -65- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 APPENDIX A - HP3000 to ARPANET Link +----------+ +----------+ | |---+ +---| | | | I | X.25 LAP | | | | HP3000 | N |--------------| | C30 IMP | | | P | | | | | |---+ +---| | +----------+ +----------+ -66- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 APPENDIX B - Protocol Software Organization +---------+ | MBB | +---+-+---+ ^ | | v +---+-+---+ | INP | +---------+ | Driver | +---+-+---+ ^ | | v ,----+-+----, High Priority / Device / User Mode /Information/ Process / Table / '----+-+----' ^ | ATTACHIO | v +-----+-+----+ | 1822 | | ------- | ,---------------, | Internet |--------->/ Transmission / | ------- |<--------/ Control Block / | TCP | '---------------' +-+--+---+--++ ^ | | | : | | | +--------:--+ | +------------+ | : | | | ...:......|...............|.... | : | : | : v : v : v : +---+-----+---+ +--+------+-+ +--+---+--+ |Server Telnet| |User Telnet| |User FTP | | Program | | Program | | Program | +-----+--+----+ +--+-+-+-+--+ +-+-+-+-+-+ ^ | | | | | | | | | | v | | | | | | | | Pseudo-TTY ,-+--+-, USERS USERS Logical Devices / PTY / (one each user) '-+-+--' ^ | | v HP3000 Command Interpreter ---- Private Message Files .... General Input Message File -67- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 APPENDIX C - Control Structures _______________________________ |POINTER TO BUFFER FREELIST | |-----------------------------| |POINTER TO END OF FREELIST | |-----------------------------| |FREE BUFFER COUNT | |-----------------------------| |POINTER TO INTERNET AGE LIST | |-----------------------------| |POINTER TO END OF INTERNET | |-----------------------------| |INTERNET AGE LIST COUNT | |-----------------------------| |POINTER TO TCP AGE LIST | |-----------------------------| |POINTER TO END OF TCP LIST | |-----------------------------| |TCP AGE LIST BUFFER COUNT | |-----------------------------| NETWORK RESOURCE CONTROL BLOCK _______________________________ |HOST NUMBER | |-----------------------------| |NUMBER OF OUTSTANDING 1822 | |PACKETS WAITING FOR RFNMS | |_____________________________| FOREIGN HOST CONTROL BLOCK -68- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 _____________________ | CONNECTION STATE | |-------------------| | LOCAL PORT NUMBER | |-------------------| | TCP INTERFACE | | CONTROL BLOCK NO. | |-------------------| |CONNECTION PRIVATE | |MESSAGE FILE ID | --------------------| GENERAL INFORMATION SECTION OF THE CONNECTION CONTROL BLOCK _____________________ |RECEIVE SEQUENCE | |-------------------| |RECEIVE WINDOW | |-------------------| |RECEIVE BUFF SIZE | |-------------------| |RECEIVE URGENT PTR | |-------------------| |RECEIVE LAST BUFF | |-------------------| |INITIAL RECEIVE | |SEQUENCE NUMBER | |-------------------| |PTR TO UN-ACKED TCP| | SEGMENTS | |___________________| CONNECTION RECEIVE DATA -69- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 _____________________ |SEND UN-ACKED | |-------------------| |SEND SEQUENCE | |-------------------| |SEND WINDOW | |-------------------| |SEND BUFFER SIZE | |-------------------| |SEND URGENT PTR | |-------------------| |SEND SEQUENCE FOR | |LAST WINDOW UPDATE | |-------------------| |SEND LAST BUFFER | |-------------------| |INITIAL SEND | |SEQUENCE NUMBER | |___________________| CONNECTION SEND DATA -70- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 ________________ FREEBUFFER QUEUE --->|NEXT BUFFER |____ |--------------| | | | | |______________| | | ______________________| | | ________________ -->|NEXT BUFFER |____ |--------------| | | | | | | | |______________| | | ______________________| | | ________________ -->|NULL | |--------------| | | | | |______________| NETWORK BUFFER FREELIST -71- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 OLDEST DATAGRAM SECOND OLDEST FRAGMENT DATAGRAM FRAGMENT INTERNET ________________ ________________ AGING --->|NEXT DATAGRAM |-------->|NEXT DATAGRAM |--> PTR TO LIST |--------------| |--------------| THIRD |NEXT BUFFER |____ |NEXT BUFFER |____ OLDEST |--------------| | |--------------| | |______________| | |______________| | | | ______________________| _____________________| | | | ________________ | _______________ -->|NEXT BUFFER |____ -->|NEXT BUFFER |____ |--------------| | |-------------| | | | | | | | | | | | | | |______________| | |_____________| | | | ______________________| ____________________| | | | ________________ | _______________ -->|NULL | -->|NULL | |--------------| |-------------| | | | | | | | | |______________| |_____________| INTERNET AGING LIST -72- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 CONNECTION 1 CONNECTION 2 OLDEST UN-ACKED SECOND OLDEST UN-ACKED SEGMENT BUFFERS SEGMENT BUFFERS TCP ________________ ________________ PTR TO AGING --->|NEXT SEGMENT |-------->|NEXT SEGMENT |--> THIRD LIST |--------------| |--------------| OLDEST |NEXT BUFFER |____ |NEXT BUFFER |____ |--------------| | |--------------| | |______________| | |______________| | | | ______________________| _____________________| | | | ________________ | _______________ -->|NEXT BUFFER |____ -->|NEXT BUFFER |____ |--------------| | |-------------| | | | | | | | | | | | | | |______________| | |_____________| | | | ______________________| ____________________| | | | ________________ | _______________ -->|NULL | -->|NULL | |--------------| |-------------| | | | | | | | | |______________| |_____________| TCP AGING LIST -73- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 APPENDIX D - Command Message Formats General Message Format _________________________ | Command Type (2 bytes)| |_______________________| | TCP Interface Control | | Block No. (2 bytes) | |_______________________| | Data (10 bytes) | |_______________________| OPEN CONNECTION Data Area Format _________________________ | Foreign Host Internet | | Address (4 bytes) | |_______________________| | Foreign Port (2 bytes)| |_______________________| | Local Port (2 bytes) | |_______________________| | Status Flag bits | | (2 bytes) | |_______________________| SEND Command Data Area Format _________________________ | Send Byte Count | | (2 bytes) | |_______________________| -74- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 Message File Command Codes ___________________________________________________ |User To TCP Command | TCP to User Command | Code | |____________________|_____________________|______| | OPEN CONNECTION | OPENCONFIRM | 0 | |____________________|_____________________|______| | CLOSE CONNECTION | NETCLOSE | 1 | |____________________|_____________________|______| | ABORT CONNECTION | NETABORT | 2 | |____________________|_____________________|______| | SEND | DATASENT | 3 | |____________________|_____________________|______| | RECEIVE | PENDING RECEIVE | 4 | |____________________|_____________________|______| | STATUS | STATUS COMPLETE | 5 | |____________________|_____________________|______| -75- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 APPENDIX E - TCP Interface Control Block GENERAL INFO SECTION OF THE TCP INTERFACE CONNECTION BLOCK +----------------------------+ | TCP STATUS BUFFER PTR | +----------------------------+ | USER PROCESS STACK DATA | | SEGMENT NUMBER | +----------------------------+ | SEND Side Open/Close flag) | +----------------------------+ |RECEIVE Side Open/Close flag| +----------------------------+ SEND PORTION OF THE TCP USER BUFFERS USED INTERFACE CONNECTION BLOCK TO TRANSMIT DATA TO THE TCP +----------------------------+ +--------------+ | Ptr to First Free Buffer |------------->| Data Count | | (user buffer whose data | |(no data bytes| | has been read by TCP | | in buffer) | +----------------------------+ +--------------+ | Ptr to Next Data Buffer | | Link to next | | (user buffer whose data +-----+ +--+ Buffer | | not been read by TCP) | | | +--------------+ +----------------------------+ | | | DATA | | Ptr to first UnAcked byte +---+ | | +--------------+ +----------------------------+ | | | | Offset in Next Data Buffer | | | | |(offset in next data buffer +-+-+ | | | to first unread data byte) | | | | | +--------------+ +----------------------------+ | | +--->-+->| Data Count | | | +--------------+ | | +--| LINK | | | | +--------------+ | +-------|->| | +-------->|->| DATA | | +--------------+ | | +--------------+ +->| Data Count | +--------------+ | LINK | +--------------+ | DATA | +--------------+ -76- IEN 167 Sax and Edmond Bolt Beranek and Newman Inc. July 1980 RECEIVE PORTION OF THE TCP USER BUFFERS USED INTERFACE CONNECTION BLOCK FROM THE TCP TO TRANSMIT DATA +----------------------------+ +--------------+ | Ptr to First Filled Buffer |------------->|Full/Filling | | (user buffer which has been| |True indicates| | filled by TCP) | |buffer is full| +----------------------------+ +--------------+ | Ptr to Next Data byte to be| +--+ Link to next | | read by user process +-------+ | | Buffer | +----------------------------+ | | +--------------+ | Ptr to First Partially Full| | | | DATA | | Buffer (buffer not yet +-----+ +->-+->| | | filled by TCP) | | | +--------------+ +----------------------------+ | | | Offset in Partially Full | | | | Buffer (next free byte for +--+ | | | TCP) | | | | +--------------+ +----------------------------+ | +-----+->| Full/Filling | | +--------------+ | +--| LINK | | | +--------------+ +------->+->| DATA | | +--------------+ | | | | +--------------+ +->| Full/Filling | +--------------+ | LINK | +--------------+ | DATA | +--------------+ -77-