Robust Header Compression Peter J. McCann INTERNET DRAFT Tom Hiller Document: draft-mccann-rohc-gehcoarch-02.txt Lucent Technologies June, 2001 Requirements and Architecture for Header Stripping and Generation Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [Bradner96]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract Efficient transmission of voice over wireless links requires significant engineering effort. Because of the high cost of bandwidth on such links, special techniques for compression of voice data and its transmission over the air have been developed. The compression techniques and the wireless physical layers have been co-designed for maximum spectral efficiency and human perceptual euphony. Voice over IP (VOIP) applications should be able to leverage this engineering effort when used over wireless links. We advocate a "header stripping and generation" approach to this problem in order to enable the end-to-end service model while achieving maximum spectral efficiency and simplicity of implementation. This document outlines an architectural framework for a wireless VOIP application, including the wireless link layer and its interface to typical IP stack implementations, and discusses the protocol elements that should be standardized between the various components. McCann, Hiller Expires 08/2001 1 GEHCOARCH February, 2001 2. Introduction Voice over IP (VOIP) promises to change radically the way that telephony services are built and delivered. Integration of voice with the Internet will not just be a change in the way traffic is carried; rather, new types of services will be made possible by the integration of voice with existing Internet applications such as the World Wide Web and e-mail. The key to these new services will be a platform that offers open programmability while offering a transport for VOIP in an integrated, robust, and efficient way. Wireless links offer great challenges to the transport of voice traffic, and significant engineering effort has gone into making them efficient for circuit voice applications. New voice compression algorithms ("codecs"), such as EVRC [TIA-IS127], SMV [TIA-SMV], or AMR [ETSI-AMR] have been developed to minimize the amount of data that must be carried, and special over-the-air channels have been implemented to carry these codecs with a minimum of overhead bits and minimal latency. VOIP flows will be carried inside the Real-Time Protocol (RTP) [Shulzrinne96] on wired links. However, for wireless links, the situation is less clear. The limited bandwidth of wireless links makes it impossible to transmit the entire IP/UDP/RTP header with every packet, as the overhead would be prohibitive. It is possible to compress these headers by transmitting only updates to the fields that change rather than the entire header [Bormann01], but these compression schemes are complex and can never entirely eliminate the overhead due to RTP. Even when the header is compressed down to one byte per frame on average, the impact on spectral capacity is significant. Also, the variable-sized frames produced by these compression protocols are unsuitable for typical wireless links that support only a limited number of frame sizes. The fundamental reason why these schemes cannot achieve the same efficiency as circuit data is that they discard information that is available at the physical channel layer, including the real-time nature of the traffic, which can assist in reconstructing the RTP header. This document describes an architectural framework that allows such real-time information to be used while not restricting the choice of call control protocol, placement of call feature servers, or mobile station architecture. All work to date on header compression has taken as a basic requirement that it will operate with no knowledge about the applications generating the compressible packets, and therefore when a packet is compressed and then decompressed, the result must be bit-for-bit identical with the original packet. However, we argue that many applications, especially those that are only concerned with transmission and playback of voice, can tolerate some amount of skew in the reproduced RTP headers. When a compressor/decompressor pair can make these assumptions, very simple and efficient header compression McCann, Hiller Expires 08/2001 2 GEHCOARCH February, 2001 can be performed. Our architecture allows applications to indicate their ability to tolerate such skew, and we discuss the conditions under which applications may do so. This allows us to implement a form of header compression that makes use of existing circuit voice implementations with minimal changes; we refer to this approach as "header stripping and generation." 3. Wireless Technology Considerations Cellular wireless technologies will support distinct bearer channels for real-time audio flows versus non-real-time data. Data for TCP, such as web or e-mail traffic, will suffer from the lossy nature of the wireless link unless a link-layer retransmission protocol is used to improve its reliability. Such a retransmission protocol (called the Radio Link Protocol or RLP in the emerging cellular data networks) does improve reliability but only at the expense of additional buffering and latency [Fairhurst01]. Real-time audio streams cannot tolerate the additional latency, which could be on the order of 1 second under adverse radio conditions. For this reason, a separate bearer channel will be used for voice that does not perform retransmission. This bearer will be very similar to the existing circuit voice channels. The architecture outlined below allows the mobile station to make effective use of this channel for VOIP. In addition to the two types of channels outlined above, there are likely to be intermediate kinds of channels intended to carry various kinds of IP multimedia data. This data is somewhat more sensitive to delay and less sensitive to loss than ordinary packet data, but is unable to use the underlying link framing in the same manner as circuit voice. Such traffic will need to be carried in, e.g., HDLC, and will be transported over an RLP that does few or no retransmissions to avoid introduction of buffering and delay. Because of these multiple channel types, each endpoint of the link will need to know when to establish new channels and how to properly allocate packet flows to channels. Applying heuristics to guess which flows should be allocated to which channels will not be acceptable; in this case, unlike the heuristics used for bit-wise transparent header compression, guessing wrong will do harm to application flows because they will not be able to meet their real-time requirements. The architecture of a mobile station should allow maximum flexibility in its hardware and software choices. Two basic mobile station models have been identified in the wireless data community. A "network model" station is one that is completely integrated, such as a phone plus browser or a palmtop with integrated radio hardware. Such a device usually has a real-time operating system, a DSP chip for processing the audio codec, and an embedded IP stack implementation. In contrast, a "relay model" station is one that is split in two: it consists of a piece of terminal equipment (such as a laptop computer) connected to a McCann, Hiller Expires 08/2001 3 GEHCOARCH February, 2001 piece of radio equipment, usually by a serial connection. The idea is to make use of the mostly stock operating system on the terminal equipment, while "relaying" the data to and from the wireless network via the radio equipment. We take the point of view that VOIP applications must be supported for both kinds of mobile stations. While network model phones will offer a tightly integrated set of services, relay model stations are likely to offer a much more open and programmable environment on the terminal equipment. As these devices evolve we expect the distinction between network and relay models will blur as the wireless device moves closer to the UNIX notion of a "network interface" to a stock operating system, and the operating system evolves to take on more real-time functionality. Both the relay and network model terminals are endpoints that both host applications and terminate the complex wireless link described above. In other words, the compressor/decompressor are always operating over the last hop link. This makes it simple for applications to express their preferences to the compressor for how their packets should be treated; an application can use a local software API for communicating with the local compressor, and link-layer signaling can communicate these preferences to the remote compressor one hop away. While we expect this will cover the vast majority of cases, there may also be a requirement to support a router with this type of wireless link interface, such as a phone that is acting as an IP-layer gateway for many IP devices carried by a single user. If voice flows are expected to originate from such devices, then new signaling protocols must be used to indicate to the (now remote) compressor how packets should be treated. Note that this not only includes the application's preference for transparency, but also which kind of underlying wireless channel should be used to carry the traffic. Previous header compression schemes have relied on heuristics to recognize which flows are RTP traffic; because they were bit-for-bit transparent, they claimed that choosing incorrectly did no harm to applications. Because we relax this bit-for-bit transparency requirement, we must be sure that a flow belongs to an application that can tolerate skew. Otherwise, the skew could do harm to applications. However, note that sending a packet over an incorrect wireless channel would also do harm, because the real-time performance needs of the application will not be met. We presume that some form of IP-layer signaling must be used to inform routers how to allocate flows to channels, and we propose that application transparency requirements can be carried at the same time. 4. Requirements In this section we examine the environment in which zero-byte header compression is expected to operate, including the required efficiency, assumptions about applications, and concerns about simplicity. McCann, Hiller Expires 08/2001 4 GEHCOARCH February, 2001 4.1 Efficiency Approximate voice activity factors (probability distribution of frame sizes) for the Selectable Mode Vocoder (SMV) are given in Figure 1. These reflect one party's activity during a typical two-way interactive voice call. Rate Activity % Payload (bits) Full 20 171 Half 20 80 Quarter 10 40 Eighth 50 16 Figure 1: Activity of the 3GPP2 Selectable Mode Vocoder This vocoder is designed to operate synchronously with the underlying physical channel: it outputs one of the above frame sizes every 20 milliseconds. Which frame size is output depends on the characteristics of the speech being compressed; typically, full-rate (171 bit) frames are used during active talk spurts, interspersed with half- and quarter-rate frames as needed. Eighth-rate frames are used mainly during silence periods, but they also contain information about the noise components present in the silence, which is referred to as "comfort noise generation". Also, the physical link typically requires that some frame be transmitted during every 20ms interval so that power control can be maintained, and the eighth-rate frames play this role. The cdma2000 air interface has been designed with these frame sizes in mind, to support optimal transport of circuit voice. It is not possible to perform a marginal adjustment to the frame sizes to accommodate header overhead. This makes application of the basic ROHC RTP profile problematic at best: if one byte of LSB-encoded sequence number is added to a frame, it must be carried in the next-higher frame format. For a full-rate frame, there is no next-higher frame format and so those frames could not be transported without breaking the synchronization with the underlying physical link and introducing additional framing, for example with the use of PPP HDLC flags or the ROHC segmentation mechanism. This would introduce another 1 or 2 bytes of overhead per frame, and would also have a multiplier effect on the frame error rate since most vocoder frames would now span two physical frames. Finally, this lack of synchronization would introduce an occasional lag between the vocoded frame time and real time that could add to the end-to-end latency and jitter of the RTP flow. Even a very conservative calculation, assuming these problems can be overcome and ignoring the contribution from eighth-rate 16 bit frames, yields an additional 400 bits per second from the header and segmentation overheads. Compared to the average 3720 bps circuit voice rate, this overhead (greater than 10%) would significantly diminish the number of calls that can be handled in a given amount of spectrum. We McCann, Hiller Expires 08/2001 5 GEHCOARCH February, 2001 conclude that because the codec and physical link have been co- engineered to such tight tolerances, we should endeavor to use the vocoder/physical link largely unchanged from its existing implementation for circuit voice. 4.2 Application Assumptions In order for real-time to serve as a proxy for the RTP sequence number, it must be the case that the sequence number increments by one for every physical layer epoch. This would be true if the transmitter sends a vocoded frame for every epoch, as is done by the existing cdma2000 vocoders even during silence intervals. Note that in 3G systems the mobile node transmits continuously even during silence so that the network may monitor power. Note also that these frames are not empty; they do carry information about the background noise components during silence, known as "comfort noise". We explicitly relax the assumption that reconstructed headers at the decompressor are bit-for-bit identical to the headers seen by the compressor. Specifically, we note that for most VOIP applications, the RTP sequence number and timestamp are primarily used to schedule frames for playback over a relatively short interval. Implementations typically maintain a playback buffer of a few frames, and place incoming voice samples into that buffer based on their timestamp and sequence number. Based on a running average of the buffer depth, frames are discarded or silence is inserted according to whether the buffer is too full or is running low, respectively. Such a playback buffer only needs the timestamps and sequence numbers to be relatively accurate; that is, over short timescales, neighboring frames should have neighboring timestamps and sequence numbers. Any small, fixed skew that is introduced into the packet stream will be quickly corrected by the playback buffer mechanism. However, not every application will be able to tolerate such skew. Defining "non-transparent compression" as any compression that changes bits end-to-end, we could make the following statements: 1. The end-system MUST be aware of any non-transparent compression. 2. The end-system MUST be able to turn off non-transparent compression if it chooses. Depending on the application semantics for each header field, we can classify that field as follows: BRITTLE These fields are those that must be reconstructed by the decompressor so that they match bit-for-bit those seen at the compressor. PLIANT These fields are those for which the application can tolerate some form of skew. McCann, Hiller Expires 08/2001 6 GEHCOARCH February, 2001 Note that BRITTLE fields can be either STATIC or CHANGING, where those terms are used as in appendix A of ROHC [Bormann01]. For simplicity, we assume that all PLIANT fields are CHANGING. We note that STATIC fields are easy to communicate precisely at initialization time, so classifying such a field as PLIANT would not ease the compression/decompression task. However, we note that for the specific application of wireless vocoders, we can often make stronger assumptions about what fields are static. For example, the Marker Bit may never be used for EVRC in wireless applications [Li01]. This lets us assume this field is STATIC. The PLIANT fields can be further classified according to what kinds of skew can be tolerated by applications. For example, a RARELY-CHANGING (RC) [Bormann01] field is updated infrequently and thereafter keeps its new value. We note that for most RC fields, it is not mandatory that such a change reach the receiver in the exact packet it was changed by the sender. For example, a CSRC list can be updated in a somewhat asynchronous manner; if the update is applied a few packets earlier or later application semantics will not be affected. Similarly, a SEMISTATIC [Bormann01] field such as one used for congestion notification does not need to be precisely synchronized with the original packet in which it was set; as long as the receiver gets the congestion notice in a reasonable amount of time it can take appropriate action. We refer to such fields as RC-PLIANT or SS-PLIANT. For fields like the RTP Timestamp and Sequence number, we introduce a new term: OFFSET-PLIANT These fields may be changed by some offset in the compression/decompression process. These fields have a STATIC delta and are incremented by that delta with each packet. The precise offset of the decompressor from the compressor is itself a RARELY-CHANGING value. Note that some applications may impose semantics on fields that make them BRITTLE. For example, if SRTP is in use [Blom01] the sequence number and/or timestamp must be matched precisely to the encrypted, vocoded frame. Also, if RTCP is being used to estimate round-trip time, these estimates will be perturbed by the offset amount. Applications may be able to tolerate different amounts of offset and it may be important in the future to characterize the amount of offset introduced by a particular implementation; however, for now we take a purely qualitative approach. Table 1 lists the CHANGING fields from the basic ROHC RTP profile [Bormann01]. For each, we give the application assumptions on pliancy that must hold for a header stripping/generation approach to preserve IP and application semantics. McCann, Hiller Expires 08/2001 7 GEHCOARCH February, 2001 +------------------------+-------------+--------------------------+ | Field | Assumption | Note | +========================+=============+==========================+ | IPv4 Id: Sequential | PLIANT | Start w/initial context | +------------------------+-------------+--------------------------+ | IP TOS / Tr. Class | RC-PLIANT | Probably never updated | +------------------------+-------------+--------------------------+ | IP TTL / Hop Limit | RC-PLIANT | Unimportant for last-hop | +------------------------+-------------+--------------------------+ | UDP Checksum: Disabled | STATIC | Checksum always disabled | +------------------------+-------------+--------------------------+ | No mix | STATIC | | | RTP CSRC Count: -------+-------------+--------------------------+ | Mixed | RC-PLIANT | Update need not be sync. | +------------------------+-------------+--------------------------+ | RTP Marker | STATIC | Disable for EVRC | +------------------------+-------------+--------------------------+ | RTP Payload Type | STATIC | | +------------------------+-------------+--------------------------+ | RTP Sequence Number |OFFSET-PLIANT| | +------------------------+-------------+--------------------------+ | RTP Timestamp |OFFSET-PLIANT| | +------------------------+-------------+--------------------------+ | No mix | - | | | RTP CSRC List: -------+-------------+--------------------------+ | Mixed | RC-PLIANT | Update need not be sync. | +------------------------+-------------+--------------------------+ Table 1 : Assumptions on the CHANGING header fields necessary for header stripping and generation. We re-classify the IPv4 Identification field as PLIANT. We assume that IPv4 Identifiers can be generated at the decompressor by incrementing from an initial value supplied by the compressor. RTP packets should not be fragmented, and the risk of an IPv4 Identifier collision with another fragmented packet should be negligible. If not, then we assume at least that Identifiers are taken from a contiguous range and do not need to be encoded with every packet. Only when a new range of Identifiers is chosen would an update need to be sent. Note that it is not important for such Identifiers to be identical to the ones visible at the compressor, only that there be no collisions with other, fragmented packets. We re-classify the Traffic Class/TOS field as RC-PLIANT. We note that for a given flow, it will probably never be updated unless Explicit Congestion Notification is in use. ECN bits could be treated as RC- PLIANT or SS-PLIANT, based on future study. It is not clear what the benefit of ECN will be for low-bitrate flows such as EVRC; such a codec will probably not respond to congestion notification. McCann, Hiller Expires 08/2001 8 GEHCOARCH February, 2001 We re-classify the TTL/Hop Limit field as RC-PLIANT. Note that for last-hop links, this field will be constant in the uplink direction and its value will be unimportant for the downlink direction, because IP forwarding will not be performed. This field only needs to be updated if the header stripping/generation is operating over a non-last hop link where there is a potential for routing loops. Even if there is the potential for routing loops, it is not necessary to update the TTL in a precisely synchronized way; a strategy of eager decrease/lazy increase, for example, would have the desired effect of stopping routing loops while not introducing too much update overhead. We re-classify the RTP Marker bit as STATIC for the applications of interest. The purpose of the Marker bit is to indicate where silence may be inserted or removed in case of playback buffer/underflow. We note that wireless codecs will typically have their own methods of detecting silence, such as the use of low-rate frames. We re-classify the CSRC count and values as RC-PLIANT. We assume that CSRCs are updated rarely, if at all, and so these updates can be carried over the sister reliable data link to the peer without imposing much additional overhead. There is no need to synchronize them precisely with the packet in which they first appear. Under the above assumptions, all BRITTLE fields are STATIC. This will allow header stripping/generation to work without adversely impacting end-to-end semantics. 4.3 Simplicity A major goal of this document is to support transport of voice over existing cellular voice channels with little or no changes on the supporting radio access equipment. Allowing a solution to completely strip out the header, transmitting only voice data on this channel, will significantly aid that goal. By not imposing any new format requirements on the vocoded frames, we allow development of future codecs to proceed with maximum flexibility. The simplicity of the supporting header compression state machine must also be considered. Wireless devices are likely to be limited in both power and memory budgets. Network access servers, while they will be implemented on larger footprint equipment, will need to support large numbers of attached devices and so scalability is a key issue. By decoupling the header initialization and updates from the synchronous voice traffic channel, it may be possible to achieve significant simplifications in the header compression protocol state machine. McCann, Hiller Expires 08/2001 9 GEHCOARCH February, 2001 5. Reference Architecture Our reference architecture is shown in Figure 2. Remote VOIP Other NRT VOIP Zero-Byte Application Apps Control------Control \ \ / | \ \ / | +-------------IP Protocol / Stack / | / | / Header Comp/ ------Data Link-----------+ Peer Decomp \ Layer System \___ | | \ | | Audio Codec \ +-------------->Physical<----+ Hardware<--->Impl <--+------------------>Channel(s)<--+ Figure 2. Reference architecture for a system implementing zero-byte header compression. The architecture diagram consists of nine components connected to a peer system by a collection of physical channels. Note that we expect zero-byte header compression to be somewhat asymmetric in that it will usually be implemented between a mobile station, where the VOIP and other applications reside, and a peer network entity that is just a data link termination point and a first-hop Internet router. As such, the peer system in the network will likely be missing the audio hardware and codec implementation, and may not participate in the VOIP control. Also, the mobile station may not need to actually perform header compression and decompression if its codec implementation is connected directly to the physical channel, which may be required to achieve the desired latency guarantees. The component named "Zero-Byte Control" would consist of the protocol logic used to set up and maintain the zero-byte header compression context. In the following subsections we discuss each of the architectural elements in turn. The next section will discuss the interfaces between them. 5.1 Non Real-time Components It is important to distinguish between the real-time and non real-time components of Figure 2. This is especially important for a relay model mobile station, as it impacts which elements of stock operating systems McCann, Hiller Expires 08/2001 10 GEHCOARCH February, 2001 can be reused and which must be implemented as new real-time extensions. In this subsection we examine the non real-time components. 5.1.1 VOIP Control The VOIP control component is the implementation of the call signaling protocol, such as SIP [Handley00] or H.323 [ITU-H323]. We make no assumptions on which protocol is used, and we do not require the network-side peer system to contain this element. The mobile station will use one of the VOIP signaling protocols to interact with call feature servers that could be anywhere on the Internet. We assume that this component will open network-layer connections and will have access to the transport endpoint identifiers for the IP/UDP/RTP flow. However, we do not require this element to actually process audio data; it will probably be implemented in user-space and could add unpredictable latency to such flows, depending on operating system characteristics. 5.1.2 Remote VOIP Application If the RTP generating application is remote from the physical link, i.e., there is at least one IP hop separating it from the compressor, then it will not have direct access to the zero-byte control component. Some network layer protocol must be introduced if it is to take advantage of zero-byte header compression. 5.1.3 IP Protocol Stack Implementation We assume that the mobile station implements an IP protocol stack in conformance with RFC 1122 [Braden89]. Note that such an implementation may not be capable of supporting hard real-time tasks. 5.1.4 Data Link Layer The data link layer is the interface between the IP protocol stack and the wireless network device. For cdma2000, this will be PPP [TIA- IS835]. For GPRS, this will be LLC [ETSI-LLC], and for UMTS, this will be PDCP [ETSI-PDCP]. For cdma2000, we assume a mostly stock PPP implementation for interaction with the physical channels that support data and perform retransmission. However, because the data link layer may not be a hard real-time component, we would not require it to be on the audio traffic path inside the mobile station. McCann, Hiller Expires 08/2001 11 GEHCOARCH February, 2001 5.1.5 Zero-Byte Control The Zero-Byte Control component is responsible for negotiating the use of header stripping/generation with the peer system and for setting up context information such as the fixed portion of the IP/UDP/RTP header. It will interact with the VOIP control component to acquire these parameters, and will send them across the data link layer to the peer system. It will also interact with the wireless device (possibly through the data link layer) to establish the physical audio channels and will identify the channel to be used when sending context information to the peer system. 5.1.6 Other Non-Real-Time Applications We expect the terminal equipment to be a general-purpose computer and as such will have other applications running. These applications may interact with other components such as the IP protocol stack, but in general will not be hard real-time tasks. These applications must co- exist will all the other components. 5.2 Real-time Components Because we make use of the real-time nature of the physical channel, several components must be implemented as real-time tasks. For a network model phone, this is similar to existing practice: a tightly integrated, real-time operating system on an embedded device schedules the audio sampling and playback to coincide with the physical frame rate of the wireless link. For a relay model terminal, we wish to make use of the audio hardware on the connected terminal equipment. This may require that the components be implemented using special real-time extensions to existing stock operating systems. 5.2.1 Audio Hardware The audio hardware consists of the analog-to-digital (A/D) and digital- to-analog (D/A) converters used for sampling and playing back sound, along with the analog microphones and speakers. In a network model phone this consists of the integrated equipment that is part of the phone. In a relay model terminal it would be the "sound card" or other audio peripheral. 5.2.2 Codec Implementation The codec implementation converts the sampled audio to and from the special wireless-specific encoding format. For a network model phone, this encoding is carried out on dedicated Digital Signal Processing McCann, Hiller Expires 08/2001 12 GEHCOARCH February, 2001 (DSP) hardware. In a relay model terminal, we assume this is performed on the general purpose CPU of the terminal equipment. 5.2.3 Physical Channel As mentioned before, there will be at least two physical channels supporting the mobile station: one that runs RLP retransmission, supporting the latency tolerant data applications; and another that resembles a voice circuit. VOIP control signaling will traverse the data-oriented RLP channel, while the voice bearer traffic will traverse the real-time circuit-like channel. Both channels must be available to the upper layers regardless of whether a relay model or network model terminal is used. The voice channel supports real-time traffic and performs no buffering. It will send a frame at precise, periodic intervals, such as 20 milliseconds for cdma2000. The codec implementation must be able to supply frames for the physical channel at exactly this rate. 5.2.4 Header Compression/Decompression The codec implementation may be directly connected to the physical channel on the mobile terminal side, and so concrete IP/UDP/RTP headers may not necessarily appear inside the mobile terminal. However, we do not prohibit a mobile terminal from reconstructing such headers if it requires them. This component is drawn next to the data link layer in the diagram, and may in fact be integrated into the data link layer implementation. It is responsible for classifying each packet coming down from the IP protocol stack against the fixed IP/UDP/RTP header fields we are attempting to compress. The value of these fields is established by the Zero-Byte Control component and installed into the header compression component, possibly via the data link layer. Once the header has been stripped this component must schedule the payload for transmission on the physical layer at the appropriate frame interval, according to the sequence number and timestamp received in the header. In the opposite direction, when packets arrive on the network side from the physical channel, this component is responsible for regenerating the proper IP/UDP/RTP header and passing the packet on to the IP protocol stack. It makes use of the physical arrival time to generate the proper timestamp and sequence number in the RTP header. Because the header compression/decompression component is sending and receiving packets from the IP protocol stack, it may be implemented as a soft real-time component. However, it must interact with the physical voice channel, which is a hard real-time component, both to properly record the frame arrival time and to schedule outgoing packets for transmission. If the header compression/decompression is McCann, Hiller Expires 08/2001 13 GEHCOARCH February, 2001 implemented in a separate network element from the physical channel, as is likely to be the case in the emerging cellular architectures [TIA- IS835], then this interaction could be accomplished with the proper use of sequence numbers on the interfaces between them so that each physical frame carries the information about when it arrived or when it is to be transmitted. 6. Interfaces In this section we examine the interfaces between the above components. We distinguish between those interfaces that should be implemented as protocols, suitable for standardization in the IETF or elsewhere, and those that should remain Application Programming Interfaces (APIs) that may or may not need to be standardized. 6.1 Protocol Reference Points In terms of new protocols, the interfaces that need to be standardized are listed below. Some of these interfaces are opportunities for IETF protocols, while others should be carried out by other standards- setting organizations. 6.1.1 Zero-Byte Control to Data Link Layer The Zero-Byte control component needs to negotiate the use of header stripping/generation with its peer and convey the static portion of the IP/UDP/RTP header to the peer. This should be done in such a way that the network side is not required to participate in the VOIP control protocol. This means the network side depends on the mobile station to inform it what are the RTP flows that should be classified by the header compression component as appropriate for sending over the physical voice channel. Rather than create a new network-layer protocol, we advocate using new data link messages between the two systems to convey this information. 6.1.2 Data Link Layer to Physical Channel Mobile terminals running PPP will typically generate an octet stream that is appropriate for an underlying physical channel running RLP. However, prior to running PPP the mobile terminal must take steps to establish the channel. Also, we require that the terminal be able to dynamically establish and release the voice channels used for real-time audio. For a network model phone this may be supported by APIs within the phone, but for a relay model terminal this signaling needs to be carried out across a serial port. Such signaling is usually the provenance of a modem control protocol ("AT commands") and standardization is probably best carried out in the International McCann, Hiller Expires 08/2001 14 GEHCOARCH February, 2001 Telecommunications Union (ITU). Note that in addition to the usual signaling to establish and release channels, we also need to obtain identifying information for each channel. This information will be used by the Zero-Byte control component to communicate the initial timestamp and sequence number offsets to the peer. It must be possible to signal this information during a running PPP session. Additional real-time information from the physical channel may improve the header compression. For example, if the precise activation time of the channel is known and can be correlated with the RTP packet flow, the compressor could initialize and communicate the precise RARELY- CHANGING offset to the decompressor. Precise information about other events that affect the offset, such as handoffs, buffer over/underflow, or clock drift between the physical channel and internal RTP timestamp, would also be useful. If properly engineered this would allow for even OFFSET-PLIANT fields to be accurate most of the time, which could allow applications like SRTP to function adequately. 6.1.3 Physical Channel to Codec or Header Compression/Decompression As stated above, the physical channel could interface directly to the codec implementation on the mobile station side and to a header compression/decompression process on the network side. For a network model phone, the codec interface may be a proprietary API. However, for a relay model terminal, we must standardize a new way to transport the frames across a serial connection in real-time. This will require that we multiplex the real-time frames with the non-real-time data for PPP. This multiplexing could be carried out with the use of escape characters on the serial interface; again, this work is probably best carried out within the ITU. Any new special characters would need to be properly inserted into the ACCM of the PPP implementation. On the network side, the physical voice channel may be separated from the header compression/decompression process by an IP network. If this is the case then each physical frame must carry a sequence number that indicates the exact frame time that it was received or is to be transmitted over the air. Standardization of such interfaces is best carried out within the 3rd Generation Partnership Projects (3GPPs). 6.1.4 Remote VOIP Application to Zero-Byte Control If the RTP generating application is remote from the wireless link, i.e., there is at least one IP hop separating it from the compressor, then it will not have direct access to the zero-byte control component. Some network layer protocol must be introduced if it is to take advantage of zero-byte header compression. This protocol could be similar to the hints that have been introduced into RSVP [Davie00], although we note that the flow specifications in RSVP are not likely to McCann, Hiller Expires 08/2001 15 GEHCOARCH February, 2001 be flexible enough to specify packet flows that contain layers of encapsulation. 6.2 API Reference Points Other interfaces between the components are best done as Application Programming Interfaces (APIs) and may or may not need to be standardized. In any case we do not advocate the standardization of APIs within the IETF and we discuss these interfaces for illustration purposes only. 6.2.1 VOIP Control to Zero-Byte Control The VOIP control component is responsible for end-to-end VOIP signaling such as SIP [Handley00] or H.323 [ITU-H323]. We expect these applications to be implemented by many different people and to use standard operating system interfaces. Also, these applications should work the same way when used in wireless or wireline settings, except that the codecs should be tailored for the specific link layer currently in use. When used over wireless links, applications may want to make use of the optimized real-time path outlined above (audio hardware to codec to physical channel) rather than taking audio data into user space, performing a user space codec transformation, constructing RTP packets, and writing them to a standard UDP socket. Such user space manipulation of audio traffic could introduce unpredictable latency to the flow, depending on the operating system characteristics. To enable the optimized real-time path, the VOIP control protocol should signal to the Zero-Byte control component that it has completed VOIP signaling and is ready to begin audio bearer flow. This signal might be a system call containing the IP/UDP/RTP parameters that have been negotiated and the codec to be used. This system call would be a one-line addition to existing VOIP client implementations. 6.2.2 Zero-Byte Control to Real-time Path When the Zero-Byte control component receives a signal from the VOIP control component that the VOIP signaling has been completed, it must take the following steps: 1) Open the new physical voice bearer channel; 2) Send the peer system information about the flow, including the static header fields and identification of the physical bearer channel; and, finally, McCann, Hiller Expires 08/2001 16 GEHCOARCH February, 2001 3) Trigger the audio hardware to begin sampling, and the codec implementation to begin encoding/decoding. The first step could be accomplished via an interface to the data link layer, or may be accomplished directly. In the second step, the existing PPP connection is used to inform the peer what header fields should be attached to the synchronous voice frames, beginning with any convenient nearby starting point, such as the first frame received. The third step requires interaction with the real-time components such as the audio hardware and codec implementation, to enable the real-time data to start flowing. If the additional real-time channel information is available concerning establishment time, handoff, clock drift, and buffer over/underflow, then additional features could be implemented to improve the transparency of the scheme. Whenever an event takes place that requires re-synchronization of the compression state, such as a physical layer reset (hard handoff) or sequence number slippage due to clock drift, the Zero-Byte control component would update its peer with the appropriate state. This update should include an offset, calculated from the time the channel was established or reset, indicating to which physical layer frame the update applies. Such offset-indicating updates should also be sent when any of the normally static header fields, such as TTL, TOS, or CSRCs change. This will enable completely transparent decompression of RTP header fields for most packets. 6.2.3 Header Compression/Decompression to Data Link Layer The header compression component must classify all traffic from the IP protocol stack as to whether it is part of the RTP flow that needs to be sent on the voice physical channel. Because it must examine each packet, it will probably be fairly tightly integrated with the data link layer. The header decompression component produces IP packets from the physical voice frames and sends them up the IP protocol stack. Getting packets to the IP protocol stack may be implemented by passing the packets through the data link layer. 6.2.4 Other Interfaces The mobile terminal potentially will be executing many simultaneous applications and we expect all of the standard interfaces (network sockets, GUI) to be present. Note that ordinary applications may want to use the audio hardware at the same time as a voice call is in progress. This could be disallowed, or a special "audio mixer" process could be introduced between the audio hardware and the codec implementation to allow such simultaneous access. For example, a McCann, Hiller Expires 08/2001 17 GEHCOARCH February, 2001 system beep noise might be mixed into the telephone call in such a way that only the mobile terminal user would hear it. Much ado has been made about the proper reconstruction of the IP Identification field for each RTP packet. We note that RTP payloads are required to stay within the path MTU [Handley99] and should never experience fragmentation. However, in order to avoid any possibility of Identification field collision with other packets that may be fragmented, a new interface could be implemented between the Zero-Byte control and the IP protocol stack to "reserve" a range of Identification values for use by the RTP flow. If the header decompression component always increments the Identification field by one for each reconstructed header, and wraps around to the beginning when the range is about to overflow, then no additional work is necessary to ensure uniqueness of IP Identification fields. 7. Conclusions This draft has presented an architecture for zero-byte header compression and its implications for both a mobile station and the supporting network. On the network side, with this architecture the peer in the network does not need to be aware of the VOIP control between the mobile and a SIP/H323 server that could be anywhere in the network. When the header compression/ decompression is performed in a network element that is physically separated from the physical channel (e.g. a PDSN from 3GPP2 [TIA-IS835]), the hard real-time requirements on this element can be alleviated through the proper use of sequence numbers on its interface to the radio channel elements. On the mobile side, this draft provides high level requirements for support of zero-byte header compression in the form of protocol interfaces and APIs. Both monolithic network style mobiles as well as relay phone mobiles with laptops are discussed. Proper architecture of the mobile station allows the segregation of hard real-time processing from the non-real-time IP stack and applications. Furthermore, convergence of wireline and wireless applications is a long-standing goal in the wireless industry. This architecture allows mobile end systems to run VOIP based applications developed for wireline access to operate in the wireless environment (although with wireless-specific codecs). The impact on VOIP applications could be as little as one line of code in the VOIP client itself. Finally, the draft has outlined protocol work items suitable for the IETF as well as external standards bodies, including the ITU and 3rd Generation Partnership Projects. Any necessary APIs could be standardized by a collaboration between operating system vendors (open source or otherwise) and third party application developers, driven by wireless service providers. McCann, Hiller Expires 08/2001 18 GEHCOARCH February, 2001 8. References [Bormann01] Bormann, C. (ed.), "RObust Header Compression (ROHC)," RFC 3095, March 2001. [Braden89] Braden, R. (ed.), "Requirements for Internet Hosts -- Communication Layers," RFC 1122, October 1989. [Bradner96] Bradner, S., "The Internet Standards Process, Revision 3," RFC 2026, October 1996. [Davie00] Davie, B., Iturralde, C., Oran, D., Casner, S., Wroclawski, J., "Integrated Services in the Presence of Compressible Flows," RFC 3006, November 2000. [ETSI-AMR] European Telecommunications Standards Institute, "Adaptive Multi-Rate (AMR) Speech Transcoding," 3G TS 26.090, February 2000. [ETSI-LLC] European Telecommunications Standards Institute, GSM 04.64. [ETSI-PDCP] European Telecommunications Standards Institute, 3G TS 25.323. [Fairhurst01] Fairhurst, G., and Wood, L., "Link ARQ Issues for IP Traffic," draft-ietf-pilc-link-arq-issues-01.txt, March 2001. Work In Progress. [Handley99] Handley, M., and Perkins, C., "Guidelines for Writers of RTP Payload Format Specifications," RFC 2736, December 1999. [Handley00] Handley, Schulzrinne, Schooler, Rosenberg, "SIP: Session Initiation Protocol," draft-ietf-sip-rfc2543bis-01.txt, August 2000. Work In Progress. [ITU-H323] International Telecommunications Union, "Packet Based Multimedia Communications Systems," ITU-T Rec. H.323, September 1999. [Li01] Li, A., (editor), "An RTP Payload Format for EVRC Speech," draft-ietf-avt-evrc-03.txt, May 2001. Work In Progress. [Shulzrinne96] Schulzrinne, H., Casner, S., Frederick, R., and Jacobson, V., "RTP: A Transport Protocol for Real-Time Applications," RFC 1889, January 1996. [TIA-IS127] Telecommunications Industry Association, "Enhanced Variable Rate Codec, Speech Service 3 for Wideband McCann, Hiller Expires 08/2001 19 GEHCOARCH February, 2001 Spread Spectrum Digital Systems," TIA/EIA/IS-127, February 1997. [TIA-IS835] Telecommunications Industry Association, "Wireless IP Network Standard," TIA/EIA/IS-835, June 2000. [TIA-SMV] Telecommunications Industry Association, "Selectable Mode Vocoder Service Option for Wideband Spread Spectrum Communication Systems," TIA PN4575, 3GPP2 C.P9001, 1997. 9. Authors' Addresses Peter J. McCann Lucent Technologies Rm 2Z-305 263 Shuman Blvd Naperville, IL 60566-7050 USA Phone: +1 630 713 9359 FAX: +1 630 713 4982 EMail: mccap@lucent.com Tom Hiller Lucent Technologies Rm 2F-218 263 Shuman Blvd Naperville, IL 60566-7050 USA Phone: +1 630 979 7673 FAX: +1 630 979 7673 EMail: tom.hiller@lucent.com Acknowledgements Thanks to Paul Francis for some of the terminology and concepts introduced in Section 4.2. McCann, Hiller Expires 08/2001 20 GEHCOARCH February, 2001 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards- related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. McCann, Hiller Expires 08/2001 21