Internet Engineering Task Force                               H. Schulzrinne
INTERNET-DRAFT                                        AT&T Bell Laboratories
                                                            October 27, 1992
                                                            Expires:  4/1/93


       A Transport Protocol for Audio and Video Conferences and other
                  Multiparticipant Real-Time Applications


Status of this Memo


This document is an Internet Draft.   Internet Drafts are working  documents
of the Internet Engineering  Task Force (IETF), its  Areas, and its  Working
Groups.   Note that other  groups may also  distribute working documents  as
Internet Drafts).

Internet Drafts  are draft  documents valid  for a  maximum of  six  months.
Internet Drafts may be  updated, replaced, or  obsoleted by other  documents
at any time.   It  is not appropriate  to use Internet  Drafts as  reference
material or  to cite  them  other than  as a  "working  draft" or  "work  in
progress."

Please check  the I-D  abstract  listing contained  in each  Internet  Draft
directory to learn the current status of this or any other Internet Draft.

Distribution of this document is unlimited.


                                  Abstract


     This  draft discusses  aspects of transporting  real-time services
    such  as voice  and  video over  the  Internet.    It compares  and
    evaluates design  alternatives for  a proposed  real-time transport
    protocol.    Appendices  touch  on issues  of  port assignment  and
    multicast address allocation.


Acknowledgments


This draft is based  on discussion within the  AVT working group chaired  by
Stephen Casner.  Eve Schooler and Stephen Casner provided valuable comments.


INTERNET-DRAFT                      RTP                     October 27, 1992

This work  was supported  in part  by  the Office  of Naval  Research  under
contract N00014-90-J-1293,  the Defense  Advanced Research  Projects  Agency
under contract NAG2-578 and a  National Science Foundation equipment  grant,
CERDCR 8500332.


Contents


1 Introduction                                                             3

2 Goals                                                                    5


3 Services                                                                 8

  3.1 Framing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

  3.2 Version Identification  . . . . . . . . . . . . . . . . . . . . . .  9

  3.3 Conference Identification . . . . . . . . . . . . . . . . . . . . . 10

    3.3.1Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.3.2Aggregation  . . . . . . . . . . . . . . . . . . . . . . . . . . 11

  3.4 Media Encoding Identification . . . . . . . . . . . . . . . . . . . 11

    3.4.1Audio Encodings  . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.4.2Video Encodings  . . . . . . . . . . . . . . . . . . . . . . . . 14

  3.5 Playout Synchronization . . . . . . . . . . . . . . . . . . . . . . 14

    3.5.1Synchronization Method . . . . . . . . . . . . . . . . . . . . . 18

    3.5.2End-of-talkspurt indication  . . . . . . . . . . . . . . . . . . 21

    3.5.3Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . 21

  3.6 Segmentation and Reassembly . . . . . . . . . . . . . . . . . . . . 21

  3.7 Source Identification . . . . . . . . . . . . . . . . . . . . . . . 22

    3.7.1Gateways, Reflectors and End Systems . . . . . . . . . . . . . . 22

    3.7.2Address Format Issues  . . . . . . . . . . . . . . . . . . . . . 24

  3.8 Energy Indication . . . . . . . . . . . . . . . . . . . . . . . . . 25


H. Schulzrinne                    Expires 4/1/93                    [Page 2]


INTERNET-DRAFT                      RTP                     October 27, 1992

  3.9 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

  3.10Security  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    3.10.1Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    3.10.2Authentication . . . . . . . . . . . . . . . . . . . . . . . . . 27

  3.11Quality of Service Control  . . . . . . . . . . . . . . . . . . . . 27


4 Conference Control Protocol                                             28

5 Packet Format                                                           28

  5.1 Data  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

  5.2 Control Packets . . . . . . . . . . . . . . . . . . . . . . . . . . 30


A Port Assignment                                                         31

B Multicast Address Allocation                                            34


C Glossary                                                                36


D Address of Author                                                       40


1 Introduction


The real-time  transport protocol  (RTP)  discussed in  this draft  aims  to
provide services commonly required by interactive multimedia conferences, in
particular playout synchronization, demultiplexing, media identification and
active-party identification.  However,  RTP is not restricted to  multimedia
conferences; it is anticipated that other real-time services such as  remote
data acquisition and control may find its services of use.

In this context, a conference describes associations that are  characterized
by the  participation  of two  or  more  agents, interacting  in  real  time
with one or  more media  of potentially  different types.    The agents  are
anticipated to be human, but may  also be measurement devices, remote  media
servers, simulators  and  the  like.    Both  two-party  and  multiple-party
associations are to be supported, where  one or more agents can take  active
roles, i.e., generate data.   Thus,  applications not commonly considered  a
conference fall under our wider definition, for example, one-way media  such
as the network equivalent of closed-circuit television or radio, traditional
two-party telephone  conversations  or  real-time  distributed  simulations.


H. Schulzrinne                    Expires 4/1/93                    [Page 3]


INTERNET-DRAFT                      RTP                     October 27, 1992

Even though  intended for  real-time interactive  applications, the  use  of
RTP for the storage  and transmission of recorded  real-time data should  be
possible, with the understanding that the interpretation of some fields such
as timestamps may be affected by this off-line mode of operation.

RTP uses  the services  of an  end-to-end transport  protocol such  as  UDP,
TCP, OSI  TPx,  ST-II  [1, 2]  or  the like(1).    The  services  used  are:
end-to-end delivery, framing, demultiplexing and multicast.  The  underlying
network is not assumed to be reliable and can be expected to lose,  corrupt,
arbitrarily delay and  reorder packets.    However,  the use  of RTP  within
quality-of-service (e.g., rate) controlled networks is anticipated to be  of
particular interest.  Network  layer support for multicasting is  desirable,
but not required.  RTP is  supported by a real-time control protocol  (RTCP)
in a relationship  similar to  that between IP  and ICMP.  However, RTP  can
function with reduced functionality without a control protocol.  The control
protocol provides minimum functionality for maintaining conference state for
a single medium.   It  is not guaranteed  to be reliable  and assumed to  be
multicast to all participants of a conference.

Conferences  encompassing  several  media   are  managed  by  a   (reliable)
conference control protocol, whose definition  is outside the scope of  this
note.    Some  aspects of  its  functionality,  however,  are  described  in
Section 4.

Within this working  group, some  common encoding rules  and algorithms  for
media should  be specified,  keeping in  mind that  this aspect  is  largely
independent of the remainder of the  protocol.  Without this  specification,
interoperability cannot be  achieved.   It  is suggested,  however, to  keep
the two aspects  as separate  RFCs as changes  in media  encoding should  be
independent of the  transport aspects.   The  encoding specification  should
include things  such as  byte order  for  multi-byte samples,  sample  order
for multi-channel audio,  the format of  state information for  differential
encodings, the segmentation of  encoded video frames  into packets, and  the
like.

As part of this  working group (or  the conference architecture  BOF/working
group),  some number  assignment  issues  will  have  to  be addressed,   in
particular for encoding  formats,  port and  address usage.    The issue  of
port assignment will be discussed in more  detail in Appendix A.  It  should
be emphasized, however,  that UDP port  assignment does  not imply that  all
underlying transport mechanisms share this or a similar port mechanism.

This draft aims  to summarize some  of the discussions  held within the  AVT
working group chaired by Stephen Casner,  but the opinions are the  author's
own.    Where  possible,  references to  previous  work  are  included,  but
the author realizes  that the  attribution of  ideas is  far from  complete.

------------------------------
 1. ST-II  is  not properly  a  transport  protocol,  as it  is  visible  to
intermediate nodes, but it provides services such as process  demultiplexing
commonly associated with transport protocols.


H. Schulzrinne                    Expires 4/1/93                    [Page 4]


INTERNET-DRAFT                      RTP                     October 27, 1992

The draft builds  on operational  experience with Van  Jacobson's and  Steve
McCanne's vat audio conferencing tool  as well as implementation  experience
with the author's Nevot network voice  terminal.  This note will  frequently
refer to  NVP  [3], the  network  voice  protocol, the  only  such  protocol
currently specified  through  an  RFC  within  the  Internet.     CCITT  has
standardized as  recommendations G.764  and G.765  a packet  voice  protocol
stack for use in digital circuit multiplication equipment.

The name RTP was  chosen to reflect the  fact that audio-visual  conferences
may not  be  the  only  applications  employing  its  services,   while  the
real-time nature of the protocol is  important, setting it apart from  other
multimedia-transport mechanisms,  such as  the MIME  multimedia mail  effort
[4].

The remainder of this draft is organized  as follows.  Section 2  summarizes
the design goals  of this  real-time transport protocol.    Then, Section  3
describes the services to  be provided in  more detail.   Section 4  briefly
outlines some of the  services added by the  conference control protocol;  a
more detailed description  is outside  the scope of  this document.    Given
the required services and design  goals, Section 5 outlines possible  packet
formats for  RTP  and  RTCP.  Two appendices  discuss  the  issues  of  port
assignment and  multicast address  allocation,  respectively.    A  glossary
defines terms and acronyms, providing references for further detail.


2 Goals


Design decisions  should  be  measured  against  the  following goals,   not
necessarily listed in order of importance:


media flexibility: While   the  primary  applications   that  motivate   the
    protocol  design  are   conference  voice  and  video,   it  should   be
    anticipated that other applications may also find the  services provided
    by the protocol useful.  Some examples include  distribution audio/video
    (for example,  the ``Radio Free Ethernet''application  by Sun) and  some
    forms of (loss-tolerant) remote data  acquisition.  Note that it may  be
    possible that different media interpret the same packet header  field in
    different ways  (e.g., a  synchronization bit  may be  used to  indicate
    the beginning  of a talkspurt  for audio  and the beginning  of a  frame
    for  video).   Also,  new  formats of  established media,  for  example,
    high-quality multi-channel audio, should be anticipated where possible.

extensible: Researchers  and  implementors  within  the  Internet  community
    are currently  only beginning to  explore real-time multimedia  services
    such  as   audio-visual  conferences.      Thus,   the  RTP  should   be
    able  to  incorporate  additional  services  as  operational  experience
    with  the  protocol  accumulates  and  as  applications  not  originally
    anticipated  find   its  services   useful.       The  same   mechanisms


H. Schulzrinne                    Expires 4/1/93                    [Page 5]


INTERNET-DRAFT                      RTP                     October 27, 1992

    should also  allow experimental  applications  to exchange  application-
    specific information  without jeopardizing  interoperability with  other
    applications.   Extensibility  is also  desirable as  it will  hopefully
    speed  along the  standardization  effort,  making the  consequences  of
    leaving out some group's favorite fixed header field less drastic.

    It should be understood that extensibility and flexibility  may conflict
    with the goals of bandwidth and processing efficiency.

independent of lower-layer protocols: RTP  should  make as  few  assumptions
    about the  underlying transport protocol  as possible.   It should,  for
    example, work  reasonably well with  UDP, TCP, ST-II,  OSI TP, VMTP  and
    experimental protocols,  for  example, protocols  that support  resource
    reservation  and quality-of-service  guarantees.    Naturally,  not  all
    transport  protocols are  equally  suited  for  real-time services;   in
    particular,  TCP may  introduce unacceptable  delays over  anything  but
    low-error-rate LANs.   Also, protocols that deliver streams rather  than
    packets needs additional framing services as discussed in Section 3.1.

    It remains to be discussed whether RTP may use services provided  by the
    lower-layer protocols  for its  own purposes (time  stamps and  sequence
    numbers, for example).

    The goal  of independence from  lower-layer considerations also  affects
    the  issue of  address  representation.   In  particular,  anything  too
    closely  tied  to  the  current  IP  4-byte  addresses  may  face  early
    obsolescence.  However, the charter of the working group is  short term,
    so that longer term  changes in the host addressing can legitimately  be
    ignored.

gateway-compatible: Operational   experience   has  shown   that   RTP-level
    gateways are necessary  and desirable for a number  of reasons.   First,
    it may  be desirable to  aggregate several media  streams into a  single
    stream and then retransmit  it with possibly different encoding,  packet
    size or  transport protocol.    A reflector  that achieves  multicasting
    by  user-level  copying  may  be  needed  where  multicast  tunnels  are
    unavailable or the end-systems are not multicast-capable.

bandwidth efficient: It  is anticipated that  the protocol will  be used  in
    networks with  a wide range of  bandwidths and with  a variety of  media
    encodings.  Despite  increasing bandwidths within the national  backbone
    networks,  bandwidth  efficiency  will  continue  to  be  important  for
    transporting conferences across  56 kb links, office-to-home  high-speed
    modem  connections and  international links.    To  minimize  end-to-end
    delay and the  effect of lost packets,  packetization intervals have  to
    be limited, which, in combination with efficient media encodings,  leads
    to short packet  sizes.   Generally, packets containing 16  to 32 ms  of
    speech are  considered optimal  [5, 6, 7].    For example,  even with  a
    65 ms  packetization  interval, a  4800 b/s  encoding produces  39  byte
    packets.   Current  Internet  voice experiments  use packets  containing
    between 20 and 22.5 ms of audio, which translates into 160 to  180 bytes


H. Schulzrinne                    Expires 4/1/93                    [Page 6]


INTERNET-DRAFT                      RTP                     October 27, 1992

    of audio  information coded at  64 kb/s.   Video  packets are  typically
    much longer, so that header overhead is less of a concern.

    For UDP multicast  (without counting the  overhead of source routing  as
    currently used in  tunnels or a  separate IP encapsulation as  planned),
    IPv4 incurs 20 bytes and  UDP an additional 8 bytes of header  overhead,
    not counting  any datalink  layer headers  of at least  4 bytes.    With
    RTP header  lengths between 4  and 8 bytes,  the total overhead  amounts
    to between 36  and 40 (or more)  bytes per audio or  video packet.   For
    160-byte audio  packets,  the overhead  of 8-byte  RTP headers  together
    with UDP, IP  and PPP headers is  25%.  For  low bitrate coding,  packet
    headers can easily double the necessary bit rate.

    Thus,  it  appears that  any  fixed  headers beyond  eight  bytes  would
    have to make a  significant contribution to the protocol's  capabilities
    to outweigh  it standing  in the way  of running  RTP applications  over
    low-speed links.  The  current fixed header lengths for NVP and vat  are
    4 and 8 bytes, respectively.   It is interesting to note that G.764  has
    a total header overhead, including  the LAPD data link layer, of only  8
    bytes, as  the voice transport is  considered a network-layer  protocol.
    The overhead is split evenly between layer 2 and 3.

    Bandwidth efficiency  can be achieved  by transporting non-essential  or
    slowly changing  protocol  state in  optional fields  or in  a  separate
    low-bandwidth control  protocol.   Also, header  compression [8] may  be
    used.

international: Even now,  audio and visual  conferencing tools are used  far
    beyond the North American continent.  It would seem appropriate  to give
    considerations to  internationalization concerns, for  example to  allow
    for the  European A-law audio encoding  and non-US-ASCII character  sets
    in textual data such as site identification.

processing efficient: At packet  arrival rates of on the  order of 40 to  50
    per second  for a single  voice or video  source, per-packet  processing
    overhead  may become  a  concern, particularly  if  the protocol  is  to
    be implemented  on other than  high-end platforms.   Multiplication  and
    division operations should be  avoided where possible and fields  should
    be aligned to their natural size, i.e., an n-byte integer is  aligned on
    an n-byte multiple, where possible.

implementable: Given  the anticipated  lifetime and  experimental nature  of
    the  protocol,  it  must be  implementable  with  current  hardware  and
    operating systems.  That  does not preclude that hardware and OS  geared
    towards real-time services  may improve the performance or  capabilities
    of the protocol, e.g., allow better intermedia synchronization.


H. Schulzrinne                    Expires 4/1/93                    [Page 7]


INTERNET-DRAFT                      RTP                     October 27, 1992

3 Services


The services that may be  provided by RTP are summarized  below.  Note  that
not all services have to  be offered.   Services anticipated to be  optional
are marked with an asterisk.


  o framing (*)

  o demultiplexing by conference/association (*)

  o demultiplexing by media source

  o demultiplexing by media encoding

  o synchronization between source(s) and destination(s)

  o error detection (*)

  o encryption (*)

  o quality-of-service monitoring (*)


In the following sections, we will discuss how these services are  reflected
in the  proposed packet  header.    Information to  be conveyed  within  the
conference can be roughly divided  into information that changes with  every
data packet  and  other information  that  stays constant  for  longer  time
periods.  State information  that does not change  with every packet can  be
carried in several different ways:


as a fixed part of the RTP header: This  method  is easiest  to  decode  and
    ensures state  synchronization between sender  and receiver(s), but  can
    be bandwidth inefficient or restrict the amount of state  information to
    be conveyed.

as a header option: The  information  is  only carried  when  needed.     It
    requires more processing by  the sending and receiving application.   If
    contained in every packet, it is also less bandwidth-efficient  than the
    first method.

within RTCP packets: This approach  is roughly equivalent to header  options
    in  terms  of processing  and  bandwidth  efficiency.    Some  means  of
    identifying  when a  particular  option  takes effect  within  the  data
    stream may have to be provided.

within conference control: The  state  information  is  conveyed  when   the
    conference is established or when the information changes.  As  for RTCP
    packets, a  synchronization mechanism  between data and  control may  be

H. Schulzrinne                    Expires 4/1/93                    [Page 8]


INTERNET-DRAFT                      RTP                     October 27, 1992

    required for certain information.

through a conference directory: This is a variant of the  conference control
    mechanism,  with a  (distributed)  directory  at a  well-known  location
    maintaining state information  about on-going or scheduled  conferences.
    Changing  state  information  during  a  conference  is  probably   more
    difficult than with conference  control as participants need to be  told
    to look  at the directory for  changed information.   Thus, a  directory
    is probably best  suited to hold  information that will persist  through
    the life of the conference, for example, its multicast group,  title and
    organizer.


The first  two methods  are examples  of in-band  signaling, the  others  of
out-of-band signaling.


3.1 Framing


To satisfy the  goal of transport  independence, we cannot  assume that  the
lower layer provides framing.  (Consider  TCP as an example, even though  it
would probably not  be used  for real-time applications  except possibly  on
a local network,  but may be  used in distributing  recorded audio or  video
segments.)  Thus, if  and only if the  underlying protocol does not  provide
framing, the RTP packet is prefixed by a 16-bit byte count.  The  byte count
could also be used by  mutual agreement if it  is deemed desirable to  carry
several RTP packets in a single TPDU for increased efficiency.


3.2 Version Identification


Humility suggests  that  we  anticipate  that  we  may  not  get  the  first
iteration of the  protocol right.   In  order to avoid  ``flag days''  where
everybody shifts  to  a new  protocol,  a  version identifier  could  ensure
continued interoperability.  This  is particularly important since UDP,  for
example, does not carry a ``next  protocol'' identifier.  The difficulty  in
interworking between the current vat  and NVP protocols further affirms  the
necessity of a version identifier.   However, the version identifier can  be
anticipated to be the most static of all proposed header fields.  Since  the
length of the header and the location and meaning of the option length field
may be affected by a version change, encoding the version within an optional
field is not feasible.

Putting the version number into the control protocol packets would make RTCP
mandatory and would  make rapid scanning  of conferences significantly  more
difficult.

vat currently offers a 2-bit version field, while this capability is missing
from NVP. Given the low bit usage  and their utility in other contexts  (IP,

H. Schulzrinne                    Expires 4/1/93                    [Page 9]


INTERNET-DRAFT                      RTP                     October 27, 1992

ST-II), it may be prudent to include a version identifier.


3.3 Conference Identification


A conference identifier (conference ID)  could serve two mutually  exclusive
functions:   providing  another  level  of  demultiplexing  or  a  means  of
logically aggregating  flows  with  different  network  addresses  and  port
numbers.  vat specifies a 16-bit conference identifier.


3.3.1 Demultiplexing


Demultiplexing by RTP  allows one association  characterized by  destination
address and port  number to carry  several distinct conferences.    However,
this appears to be necessary only  if the number of conferences exceeds  the
demultiplexing capability available through  (multicast) addresses and  port
numbers.

Efficiency arguments  suggest that  combining several  conferences or  media
within a  single  multicast group  is  not  desirable.    Combining  several
conferences or media within a single multicast address negates the bandwidth
efficiency afforded  by  multicasting.    Also,  applications that  are  not
interested in a particular conference or capable of dealing with  particular
medium are still forced to handle the packets delivered for that  conference
or medium.    Consider as  an  example two  separate applications,  one  for
audio, one for video.   If both share the  same multicast address and  port,
being differentiated only by the conference identifier, the operating system
has to  copy each  incoming  audio and  video  packet into  two  application
buffers and perform a context switch to both applications, only to have  one
immediately discard the incoming packet.

Given that application-layer demultiplexing  has strong negative  efficiency
implications and  given  that  multicast  addresses  are  not  an  extremely
scarce commodity, there seems  to be no reason  to burden every  application
with maintaining  and checking  conference identifiers  for the  purpose  of
demultiplexing.

It is  also  not  recommended  to use  this  field  to  distinguish  between
different encodings, as it would be difficult for the application to  decide
whether a new conference identifier means that a new conference has  arrived
or simply all  participants should  be moved to  the new  conference with  a
different encoding.   Since the  encoding may  change for some  but not  all
participants, we could find ourselves  breaking a single logical  conference
into several pieces,  with a  fairly elaborate control  mechanism to  decide
which conferences logically belong together.


H. Schulzrinne                   Expires 4/1/93                    [Page 10]


INTERNET-DRAFT                      RTP                     October 27, 1992

3.3.2 Aggregation


Particularly within a  network with a  wide range  of capacities,  differing
multicast groups  for  each  media  component  of  a  conference  allows  to
tailor the  media  distribution to  the  network bandwidths  and  end-system
capabilities.  It appears  useful, however, to  have a means of  identifying
groups that logically  belong together,  for  example for  purposes of  time
synchronization.

A conference  identifier used  in  this manner  would  have to  be  globally
unique.  It appears that such logical connections would better be identified
as part  of the  control  protocol by  identifying all  multicast  addresses
belonging to the same logical conference, thereby avoiding the assignment of
globally unique identifiers.


3.4 Media Encoding Identification


This field  plays  a  similar  role  to the  protocol  field  in  data  link
or network protocols,  indicating  the next  higher layer (here,  the  media
decoder) that the data is meant for.  For RTP, this field would indicate the
audio or video or other media encoding.  In general, the number of  distinct
encodings should be kept  as small as possible  to increase the chance  that
applications can interoperate.   A  new encoding should  only be  recognized
if it significantly  enhances the  range of media  quality or  the types  of
networks conferences can be conducted  over.  The unnecessary  proliferation
of encodings can be reduced by making reference implementations of  standard
encoders and decoders widely available.

It should be noted that encodings may  not be enumerable as easily as,  say,
transport protocols.   A particular family of  related encoding methods  may
be described by a set of parameters,  as discussed below in the sections  on
audio and video encoding.

Encodings may change  during the  duration of  a conference.    This may  be
due to changed network  conditions, changed user  preference or because  the
conference is joined  by a new  participant that cannot  decode the  current
encoding.    If  the  information necessary  for  the  decoder  is  conveyed
out-of-band, some means of indicating when the change is effective needs  to
be incorporated.  Also, the indication that the encoding is about to  change
must reach all receivers reliably before the first packet employing the  new
encoding.   Each receiver needs  to track pending  changes of encodings  and
check for every incoming packet whether an encoding change is to take effect
with this packet.

Conveying media encodings  rapidly is  also important to  allow scanning  of
conferences or broadcast media.  A directory service could provide  encoding
information for on-going conferences.  This may not be sufficient,  however,
unless all participants within  a conference use the  same encoding.   Also,

H. Schulzrinne                   Expires 4/1/93                    [Page 11]


INTERNET-DRAFT                      RTP                     October 27, 1992

the usual synchronization  problems between transmitted  data and  directory
information apply.

There are  at least  two approaches  to  indicating media  encoding,  either
in-band or out-of-band:


conference-specific: Here,  the media identifier  is an index  into a  table
    designating the  approved or  anticipated encodings  (together with  any
    particular  version  numbers  or  other  parameters)  for  a  particular
    conference or  user community.   The  table can  be distributed  through
    RTCP, a  conference control  protocol or some  other out-of-band  means.
    Since the number of encodings used during a single conference  is likely
    to be  small,  the field  width in  the header  can likewise  be  small.
    Also, there is no  need to agree on an Internet-wide list of  encodings.
    It should be  noted that conveying the  table of encodings through  RTCP
    forces the  application to maintain  a separate mapping  table for  each
    sender as there can be  no guarantee that all senders will use the  same
    table.

global: Here,  the  media  identifier  is  an  index  into  a  global  table
    of  encodings.     A  global  list  reduces  the  need  for  out-of-band
    information.   Transmitting the parameters  associated with an  encoding
    may be difficult, however, if it has to be done within the  header space
    constraints of per-packet signaling.


To make detecting coder  mismatches easier, encodings  for all media  should
be drawn from the same numbering space.  To facilitate experimentation  with
new encodings,  a part  of any  global encoding  numbering space  should  be
set aside for experimental  encodings, with numbers  agreed upon within  the
community experimenting with the  encoding, with no Internet-wide  guarantee
of uniqueness.


3.4.1 Audio Encodings


Audio data  is  commonly  characterized by  three  independent  descriptors:
encoding (the  translation of  one  or more  audio  samples into  a  channel
symbol), the number of channels (mono, stereo) and the sampling rate.

Theoretically, sampling rate  and encoding  are (largely) independent.    We
could, for example, apply  =-law encoding to  any sampling rate even  though
it is traditionally used with a rate 8,000  Hz.  In practical terms, it  may
be desirable to limit the combinations of encoding and sampling rate to  the
values the encoding was designed for.  (2).

------------------------------
 2. Given the  wide availability  of =-law  encoding and  its low  overhead,
using it  with  a sampling  rate  of 16,000  or  32,000 Hz  might  be  quite


H. Schulzrinne                   Expires 4/1/93                    [Page 12]


INTERNET-DRAFT                      RTP                     October 27, 1992

Channel counts between 1 and 4 should be sufficient and can be encoded  into
2 bits by encoding the channel count minus one.

The audio  encodings  listed in  Table  1 appear  particularly  interesting,
even though the list  is by no  means exhaustive and  does not include  some
experimental protocols currently  in use,  for example  a non-standard  form
of LPC.  The bit  rate  is shown  per  channel.    ks/s, b/sample  and  kb/s
denote kilosamples per  second,  bits per  sample and  kilobits per  second,
respectively.  If sampling rates are to be specified separately, the  values
of 8,  16,  32, 44.1,  and  48 kHz  suggest  themselves, even  though  other
values (11.025  and  22.05 kHz)  are  supported on  some  workstations  (the
Silicon Graphics  audio hardware  and  the Apple  Macintosh,  for  example).
Clearly, little is  to be gained  by allowing arbitrary  sampling rates,  as
conversion particularly between  rates not  related by  simple fractions  is
quite cumbersome and processing-intensive.


   _Org.______Name_____k_samples/s__bits/sample__kb/s__description_______
    CCITT     G.711            8.0            8    64  =-law PCM
    CCITT     G.711            8.0            8    64  A-law PCM
    CCITT     G.721            8.0            4    32  ADPCM
    Intel     DVI              8.0            4    32  APDCM
    CCITT     G.723            8.0            3    24  ADPCM
    CCITT     G.726                                    ADPCM
    CCITT     G.727                                    ADPCM
    NIST/GSA  FS 1015          8.0                2.4  LPC-10E
    NIST/GSA  FS 1016          8.0                4.8  CELP
    NADC      IS-54            8.0               7.95  VSELP
    CCITT     G.7xy            8.0                 16  LD-CELP
    GSM                        8.0                 13  RPE-LPC
    CCITT     G.722            8.0                 64  7 kHz, SB-ADPCM
                                                  256  MPEG audio
                              32.0           16   512  DAT
                              44.1           16 705.6  CD, DAT playback
                              48.0           16   786  DAT record


             Table 1:  Standardized and common audio encodings
------------------------------
appropriate for high-quality audio conferences, even though there are  other
encodings, such as G.722, specifically designed for such applications.  Note
that the signal-to-noise ratio of =-law encoding is about 38 dB,  equivalent
to an AM receiver.  The  ``telephone quality'' associated with G.711 is  due
primarily to the  limitation in  frequency response to  the 200  to 3500  Hz
range.


H. Schulzrinne                   Expires 4/1/93                    [Page 13]


INTERNET-DRAFT                      RTP                     October 27, 1992

3.4.2 Video Encodings


Common video encodings are listed in Table  2.  Encodings with tunable  rate
can be configured  for different  rates,  but produce  a fixed-rate  stream.
The average bit rate produced by variable-rate codecs depends on the  source
material.


         _Org.________name___rate_______________remarks___________
          CCITT       JPEG   tunable
          CCITT       MPEG   variable, tunable
          CCITT       H.261  tunable
          Bolter             variable, tunable
          PictureTel         ??
          BBN         DVC    variable, tunable  block differences


                      Table 2:  Common video encodings


3.5 Playout Synchronization


A major purpose of RTP is synchronization between the source and sink(s)  of
a single medium.  Note that this is to be distinguished from synchronization
between different media such as audio and  video (lip sync).  Sometimes  the
two forms are  referred to as  intra-media and inter-media  synchronization.
RTP concerns  itself  only  with  intra-media  or  playout  synchronization,
although the mechanisms such as  timestamps may be necesary for  inter-media
synchronization.

In connection  with  playout  synchronization,  we can  group  packets  into
playout units, a number of which in turn form a synchronization unit.   More
specifically, we define:


synchronization unit: A  synchronization  unit  consists  of  one   or  more
    playout units (see below) that,  as a group, share a common fixed  delay
    between generation and  playout of each  part of the group.   The  delay
    may change at  the beginning of such a  synchronization unit.  The  most
    common synchronization  units are  talkspurts for voice  and frames  for
    video transmission.

playout unit: A  playout  unit  is  a group  of  packets  sharing  a  common
    timestamp.    (Naturally,  packets whose  timestamps are  identical  due
    to timestamp  wrap-around are not  considered part of  the same  playout
    unit.)   For voice, the playout unit  would typically be a single  voice
    segment,  while for  video  a video  frame  could  be broken  down  into
    subframes, each  consisting of  packets sharing the  same timestamp  and
    ordered by some form of sequence number.

H. Schulzrinne                   Expires 4/1/93                    [Page 14]


INTERNET-DRAFT                      RTP                     October 27, 1992

All proposed synchronization  methods require  a timestamp.   The  timestamp
has to  have  a sufficient  range  that wrap-arounds  are  infrequent.    It
is desirable that  the range  exceeds the maximum  expected inactive  (e.g.,
silence) period.  Otherwise, special  handling may be necessary in the  case
of the sequence number/time stamp combination  as the beginning of the  next
active period could have a  time stamp one greater  than the last one,  thus
masking the beginning of the talkspurt.  The 10-bit timestamp used by NVP is
generally agreed to be too small as  it wraps around after only 20.5 s  (for
20 ms audio packets), while a 32-bit timestamp should serve all  anticipated
needs, even  if the  timestamp is  expressed in  units of  samples or  other
sub-packet entities.

Three proposals  as  to  the  interpretation  of  the  timestamp  have  been
advanced:


packet/frame: Each packetization or (video/audio) frame interval  increments
    the timestamp.    This approach very  efficient in  terms of  processing
    and  bit-use, but  cannot  be used  without out-of-band  information  if
    the time interval  of media ``covered'' by  a packet varies from  packet
    to packet.    This occurs  for  example with  variable-rate encoders  or
    if the  packetization interval  is changed during  a conference.    This
    interpretation of a timestamp  is assumed by NVP, which defines a  frame
    as a  block of  PCM samples  or a single  LPC frame.    Note that  there
    is no inherent necessity  that all participants within a conference  use
    the same  packetization interval.   Local implementation  considerations
    such  as available  clocks may  suggest  other intervals.    As  another
    example, consider a conference with feedback.  For the lecture  audio, a
    long packetization interval may  be desirable to better amortize  packet
    headers.  For side  chats, delays are more important, thus suggesting  a
    shorter packetization interval.(3)

sample: This  method simply counts  samples, allowing  a direct  translation
    between time  stamp and  playout buffer  insertion point.    It is  just
    as easily  computable as the  per-packet timestamp.   However, for  some
    media and  encodings(4), it  may not be  quite clear what  a sample  is.
    Also, some care  must be taken at the  receiver if incoming streams  use
    different sampling rates.  This method is currently used by vat.

subset of NTP timestamp: 16  bits  encode seconds  relative  to  0  o'clock,

------------------------------
 3. Nevot   for  example,  allows  each  participant  to  have  a  different
packetization interval, independent  of the packetization  interval used  by
Nevot for its outgoing audio.  Only the packetization interval for  outgoing
audio for all conferences must be the same.
 4. Examples include frame-based encodings such as LPC and CELP. Here, given
that these encodings  are based  on 8,000  Hz input  samples, the  preferred
interpretation would probably be in terms  of audio samples, not frames,  as
samples would be used for reconstruction and mixing.


H. Schulzrinne                   Expires 4/1/93                    [Page 15]


INTERNET-DRAFT                      RTP                     October 27, 1992

    January  1, 1900  (modulo  65536) and  16  bits  encode fractions  of  a
    second, with  a resolution of  approximately 15.2 =s,  which is  smaller
    than any  anticipated audio  sampling or  video  frame interval.    This
    timestamp  is  the  same as  the  middle  32  bits  of  the  64-bit  NTP
    timestamp [9].   It  wraps around  every 18.2 hours.   If  it should  be
    desirable to reconstruct absolute transmission time at the  receiver for
    logging or recording purposes,  it should be easy to determine the  most
    significant 16 bits of  the timestamp.  Otherwise, wrap-arounds are  not
    a significant problem as long  as they occur 'naturally', i.e., at a  16
    or 32 bit boundary,  so that explicit checking on arithmetic  operations
    is not required.   Also, since the translation mechanism would  probably
    treat the  timestamp  as a  single integer  without accounting  for  its
    division  into whole  and  fractional  part, the  exact  bit  allocation
    between seconds and fractions  thereof is less important.  However,  the
    16/16 approach simplifies extraction from a full NTP timestamp.

    The NTP-like  timestamp has  the disadvantage that  its resolution  does
    not  map into  any of  the common  sample  intervals.   Thus,  there  is
    a potential  uncertainty of one sample  at the receiver  as to where  to
    place the beginning of the received packet, resulting in  the equivalent
    of a  one-sample slip.   CCITT  recommendation G.821  postulates a  mean
    slip rate of less than  1 slip in 5 hours, with degraded but  acceptable
    service for  less than  1 slip  in 2 minutes.    Tests with  appropriate
    rounding conducted  by  the author  showed that  this most  likely  does
    not  cause  problems.     In  any  event,  a  double-precision  floating
    point multiplication is needed  to translate between this timestamp  and
    the integer  sample count  available  on transmission  and required  for
    playout.(5)

    It has  been suggested to  use timestamps relative  to the beginning  of
    first transmission from  a user.   This makes correlation between  media
    from different participants difficult and seems to have no  technical or
    implementation advantages, except  for avoiding wrap-around during  most
    conferences.  As pointed out above, that seems to be of  little benefit.
    Clearly, the reliability of a wallclock-synchronized timestamps  depends
    on how  closely the system  clocks are synchronized,  but that does  not
    argue for giving up potential real-time synchronization in all cases.

    It  also needs  to be  decided  whether the  time stamp  should  reflect
    real time or  sample time.   A real time  timestamp is defined to  track
    wallclock time plus or minus  a constant offset.  Sample time  increases
    by the nominal  sampling interval for  each sample.   The two clocks  in
    general do not  agree since the clock source  used for sampling will  in
    all likelihood be slightly off  the nominal rate.  For example,  typical
    crystals without temperature  control are only accurate to   50 -- 100

------------------------------
 5. The  multiplication  with  an appropriate  factor  can  be  approximated
to the desired  precision by  an integer  multiplication and  division,  but
multiplication by a floating point value is generally much faster on  modern
processors.


H. Schulzrinne                   Expires 4/1/93                    [Page 16]


INTERNET-DRAFT                      RTP                     October 27, 1992

    ppm (parts per million), yielding a potential drift of 0.36  seconds per
    hour between the sampling clock and wallclock time.

    Using   real  time   rather  than   sample   time  allows   for   easier
    synchronization between  different media and to  compensate for slow  or
    fast sample  clocks.  Note  that it is  neither desirable nor  necessary
    to obtain the  wall clock time  when each packet was  sampled.   Rather,
    the  sender determines  the  wallclock time  at  the beginning  of  each
    synchronization  unit (e.g.,  a  talkspurt for  voice  and a  frame  for
    video)  and adds  the  nominal sample  clock  duration for  all  packets
    within  the talkspurt  to  arrive  at  the timestamp  value  carried  in
    packets.   The real time at the  beginning of a talkspurt is  determined
    by estimating the true sample rate for the duration of the conference.

    The sample  rate estimate  has to be  accurate enough  to allow  placing
    the  beginning  of  a   talkspurt,  say,   to  within  at  most  50   to
    100  ms,  otherwise  the lack  of  synchronization  may  be  noticeable,
    delay  computations  are  confused  and  successive  talkspurts  may  be
    concatenated.

    Estimating the  true sampling instant  to within a  few milliseconds  is
    surprisingly difficult for current  operating systems.  The sample  rate
    r can to be estimated as

                                   r =_s+_q_t:-t0
    Here,  t is  the  current time,  t0  the time  elapsed since  the  first
    sample  was acquired,  s  is  the  number  of samples  read,  q  is  the
    number of  samples ready  to be  read (queued)  at time  t.   Then,  the
    timestamp to be inserted into the synchronization packet is  computed as
    t0+ tr.   Unfortunately, only  s is known  precisely.   The accuracy  of
    the estimate  for t0  and t depend  on how accurately  the beginning  of
    sampling and  the last reading  from the audio  device can be  measured.
    There is  a non-zero  probability that  the process  will get  preempted
    between the  time the  audio data  is read  and the  instant the  system
    clock is  sampled.   It remains unclear  whether indications of  current
    buffer  occupancy, if  available,  can be  trusted.    Experiments  with
    the SunOS  audio driver showed significant  variations of the  estimated
    sample rate,  with discontinuities of the  computed timestamps of up  to
    25 ms.   Kernel support  is probably required  for meaningful real  time
    measurements.

    Sample  time increments  with the  sampling  interval for  every  sample
    or  (sub)frame received  from  the  audio or  video  hardware.    It  is
    easy  to determine,  as  long  as  care  is taken  to  avoid  cumulative
    round-off errors  incurred by simply  repeatedly adding the  approximate
    packetization interval.    However,  synchronization between  media  and
    end-to-end delay measurements  are then no  longer feasible.   (Example:
    Consider an  audio and  video stream.    If  the audio  sample clock  is
    slightly faster  than the  real clock and  the video  sampling clock,  a
    video and audio  frame belonging together  would be marked by  different
    timestamps, thus played out at different instants.)


H. Schulzrinne                   Expires 4/1/93                    [Page 17]


INTERNET-DRAFT                      RTP                     October 27, 1992

    If we  are forced  to use sample  time, the  advantage of  using an  NTP
    timestamp  disappears, as  the  receiver can  easily reconstruct  a  NTP
    sample-based timestamp from  the sample count if  needed, but would  not
    have to  if no  cross-media synchronization  is  required.   RTCP  could
    relate the time increment per sample in full precision.

    It  should  be noted  that  it  may  not be  possible  to  associate  an
    meaningful  notion of  time  with every  packet.    For  example,  if  a
    video  frame is  broken  into several  fragments,  there is  no  natural
    timestamp associated with anything but the first fragment,  particularly
    if there  is not  even a sequential  mapping from  screen scan  location
    into packets.    Thus, any  timestamp used would  be purely  artificial.
    A synchronization  bit could  be used in  this particular  case to  mark
    beginning of synchronization units.  For packets  within synchronization
    units, there are  two possible approaches:   first, we can introduce  an
    auxiliary sequence number  that is only used  to order packets within  a
    frame.   Secondly, we  could abuse the  timestamp field by  incrementing
    it by a  single unit for each packet  within the frame, thus allowing  a
    variable number  of frames per packet.   The  latter approach is  barely
    workable and rather kludgy.


3.5.1 Synchronization Method


Timestamp/sequence number: This  method  is  currently  used  by   NVP.  The
    sequence number  is  incremented with  every transmitted  packet.    For
    audio,  the  beginning  of a  talkspurt  is  indicated  when  successive
    packets differ in  timestamp more than  they differ in sequence  number.
    As long as packets are not reordered, determination of the  beginning of
    a talkspurt is generally easy, except for the unlikely case where  a new
    talkspurt has a  time stamp that, due  to timestamp wrap-around, is  one
    greater than the last packet of the previous talkspurt.

    However, if  packets are  reordered, delay adaptation  at the  beginning
    of a talkspurt  becomes unreliable.  Consider  the scenario laid out  in
    Table 3.    For  convenience, the  example assumes  that clocks  at  the
    transmitter and  receiver are perfectly  synchronized; also,  timestamps
    are expressed in  wallclock time, increasing by  20 time units for  each
    packet.   The current playout  delay, that is,  the jitter estimate,  is
    set at  50 time units  and is  assumed to stay  constant throughout  the
    example.    In the  table, packet  210 is  recognized as  the  beginning
    of a  new talkspurt if  there has been  no reordering.   If packets  210
    and 211  arrive in  reverse transmission  order, the  receiver can  only
    conclude  that packet  211 introduces  a  new talkspurt.    Because  the
    wrong packet  is treated as  the beginning of  a talkspurt, the  playout
    delay is really one  packetization interval too short for the  remainder
    of  the  talkspurt.    In  the example,  packet  212  arrives  too  late
    and misses  its playout  time, even  though it would  have made  playout
    without reordering.   This  scenario assumes that  packets are mixed  in
    at the  time of arrival so  that their playout  time cannot be  changed.


H. Schulzrinne                   Expires 4/1/93                    [Page 18]


INTERNET-DRAFT                      RTP                     October 27, 1992

    It is  possible to relax  that assumption and  reschedule packets  after
    discovering that the wrong  packet was used as the talkspurt  beginning;
    this, however, would  seem to complicate the implementation greatly,  as
    determining how  long the  mixing is  to  be delayed  cannot be  readily
    decided.  Unfortunately,  reordering at the beginning of a talkspurt  is
    particularly likely  since common  silence detection  algorithms send  a
    group of packets to prevent front clipping.


                               no reordering   with reordering
            seq.  timestamp  arrival  playout  arrival  playout
             200       1020     1520     1570     1520     1570
             201       1040     1530     1590     1530     1590
             210       1220     1720     1790     1725     1770
             211       1240     1725     1810     1720     1790
             212       1260     1825     1830     1825     1810


     Table 3:  Example where out-of-order arrival leads to packet loss

timestamp/synchronization bit: This  method   is  currently  used  by   vat.
    Here,  the  beginning  of  a  talkspurt  is  indicated  by  setting  the
    synchronization  bit.    A  sequence  number  is not  required.     This
    synchronization method  is unaffected by  out-of-order packet  delivery.
    If  the  first  packet of  a  talkspurt  is  lost,  two  talkspurts  are
    simply merged,  without  dire consequences  except for  a missed  chance
    to have  the  playout delay  reflect the  delay jitter  estimate.    The
    synchronization  bit  has to  be  ignored  if a  packet  with  a  larger
    timestamp has already arrived.

    The insertion rule can thus be expressed as
                             ae
                         ln=   l1+p(tn-t1+)dmforanx>1forn =1             (1)


    where ln  denotes the location  within the playout  buffer for packet  n
    within a  talkspurt, tn the  timestamp of packet  n within a  talkspurt,
    p  the  current  playout  location  (the  read  pointer)  and  dmax  the
    current estimated playout  delay, that is,  the estimated maximum  delay
    variation.   All  quantities are  measured in  appropriate units  (time,
    samples, or bytes).  Addition is performed modulo the buffer size.

    The role  of the  synchronization bit  for  packet video  remains to  be
    defined.   It does  not have to  bear any  relationship to the  content,
    e.g., frame structure of  a packet video source, as it merely  indicates
    where delay can be varied without affecting perceived quality.

    The  disadvantage of  this  scheme is  that  it is  impossible  for  the
    receiver  to  get an  accurate  count  of the  number  of  packets  that
    it should  have  received.   While  gaps within  a talkspurt  give  some
    indication of packet  loss, we cannot  tell what part of  the tail of  a

H. Schulzrinne                   Expires 4/1/93                    [Page 19]


INTERNET-DRAFT                      RTP                     October 27, 1992

    talkspurt has  been transmitted.    (Example:   consider the  talkspurts
    with time stamps 100,  101, 102, 110, 111, where packets with  timestamp
    100 and  110 have  the synchronization bit  set.   At  the receiver,  we
    have no  way of knowing whether  we were supposed  to have received  two
    talkspurts with a total of five packets, or two or more  talkspurts with
    up to 12  packets.)  We can  overcome this difficulty by enhancing  RTCP
    as discussed in Section 3.11.

synchronization bit/sequence number within talkspurt: G.764 implements  this
    method.   The sequence number zero is  reserved for the first packet  of
    a talk  spurt, while  sequence numbers  1 through  15 are  used for  the
    remaining packets within  the talkspurt, wrapping  around from 15 to  1,
    if necessary.   This is equivalent to the synchronization bit  described
    earlier.   A sequence  number gap also  triggers a new  talkspurt.   The
    scheme is  designed for  networks that  cannot  reorder packets.    With
    reordering,  packets  may easily  be  played  out in  the  wrong  order.
    Consider, for example,  packets with sequence numbers 0,  1, and 2.   If
    the  packets arrive  in  the order  1,  2, 0,  the  receiver  interprets
    this as a  two talkspurts and plays the  packets in the order  received.
    From the example,  we can generalize  that sequence numbers that  number
    packets  within a  talkspurt  are not  suitable  for networks  that  can
    reorder packets if used without timestamps.

    G.764 also  features a  delay accumulator  field, into  which each  node
    adds the  queueing and processing  delay accumulated  at that node.    A
    one-byte field  is used to  encode delays between  0 and  200 ms with  a
    resolution of 1  ms.  The  resolution of 1  ms suffices since the  delay
    estimate affects only the placement of the beginning of a talkspurt.

    Note that the  synchronization mechanism does  not depend on this  delay
    value.   The delay value does, however,  allow the application to  gauge
    how  congested the  underlying  network is.    With  a  delay  estimate,
    equation (1) changes so that

                                l1 =p +dmax- d1

    Thus, the end-to-end delay is the maximum variable delay plus  the fixed
    delay, rather  than the  sum of  estimated maximum  variable delay,  the
    fixed delay and  the variable delay experienced  by the first packet  in
    the talkspurt.   Thus, the end-to-end  delay is lower without  affecting
    the late loss probability.  The delay accumulator could be used  for any
    of the synchronization schemes described here.

    Despite this  benefit, its use within  the Internet appears  impossible,
    as we cannot  expect routers to update a  field in an application  layer
    protocol like RTP.


H. Schulzrinne                   Expires 4/1/93                    [Page 20]


INTERNET-DRAFT                      RTP                     October 27, 1992

3.5.2 End-of-talkspurt indication


An end-of-talkspurt indication  is useful to  distinguish silence from  lost
packets.   The  receiver would  want to  replace silence  by an  appropriate
background noise  level  to  avoid  the  ``noise-pumping''  associated  with
silence  detection.     On  the  other  hand,  missing  packets   should  be
reconstructed from previous  packets.   If  the silence  detector makes  use
of hangover, the transmitter can  easily set the end-of-talkspurt  indicator
on the last  bit of  the last  hangover packet.   If  the talkspurts  follow
end-to-end, the  end-of-talkspurt  indicator has  no  effect except  in  the
case where the  first packet of  a talkspurt  is lost.   In  that case,  the
indicator would erroneously  trigger noise  fill instead  of loss  recovery.
The end-of-talkspurt indicator  is implemented  in G.764 as  a ``more''  bit
which is set to one for all but the last packet within a talkspurt.


3.5.3 Recommendation


Given the  ease  of  cross-media  synchronization,  the  media  independence
(except for  the  sub-frame  aspect mentioned),  the  use  of  32-bit  16/16
timestamps representing the middle part  of the NTP timestamp is  suggested.
Generally, a  real-time  based  timestamp  appears to  be  preferable  to  a
sample-based one, but it  may not be realizable  for some current  operating
systems.   Inter-media  synchronization  has to  await mechanisms  that  can
accurately determine  when  a particular  sample  was actually  received  by
the A/D  converter.    Given the  lower overhead  and  the ease  of  playout
reconstruction, a  synchronization bit  appears preferable  to the  sequence
number/time stamp combination.  Since sequence numbers are useful for  cases
where packets do not carry meaningful timing information and also ease  loss
detection, they should be provided for space permitting.


3.6 Segmentation and Reassembly


For high-bandwidth  video,  a single  frame may  not  fit into  the  maximum
transport unit (MTU). Thus,  some form of frame  sequence number is  needed.
If possible, the same sequence number should be used for synchronization and
fragmentation.  Four possibilities suggest themselves:


overload timestamp: No  sequence  number is  used.    Within  a  frame,  the
    timestamp has  no meaning.   Since it is  used for synchronization  only
    when the  synchronization  bit is  set, the  other timestamps  can  just
    increase  by one  for  each packet.    However,  as  soon as  the  first
    frame gets lost or  reordered, determining positions and timing  becomes
    difficult or impossible.


H. Schulzrinne                   Expires 4/1/93                    [Page 21]


INTERNET-DRAFT                      RTP                     October 27, 1992

continuous: The  sequence number  is  incremented  without regard  to  frame
    boundaries.    If a  frame consists  of a  variable number  of  packets,
    it  may not  be  clear what  position  the packet  occupies  within  the
    frame if  packets are lost  or reordered.   Continuous sequence  numbers
    make it  possible to  determine if all  packets for  a particular  frame
    have  arrived, but  only  after the  first  packet  of the  next  frame,
    distinguished by a new timestamp, has arrived.

within frame: Naturally, this  approach has properties complementary to  the
    first.

continuous with first-packet option: Packets  use   a  continuous   sequence
    number plus an  option in every  packet indicating the initial  sequence
    number within  the playout  unit(6).    Carrying both  a continuous  and
    packet-within-frame count achieves the same effect.

continuous with last-packet option: Packets  carry  a  continuous   sequence
    number plus  an  option in  every packet  indicating the  last  sequence
    number  within the  playout unit.    This  has  the advantage  that  the
    receiver can readily  detect when the last  packet packet for a  playout
    unit has been received.   The transmitter may not know, however, at  the
    beginning of a  playout unit how many packets  it will comprise.   Also,
    the position within the  playout unit is more difficult to determine  if
    the initial packet is lost.


It could be  argued that  encoding-specific location  information should  be
contained within the media part,  as it will likely  vary in format and  use
from one media to the next.


3.7 Source Identification


3.7.1 Gateways, Reflectors and End Systems


It is necessary to be able to  identify the origin of the real-time data  in
terms meaningful to the application.  First, this is required to demultiplex
sites (or sources)  within the  same conference.    Secondly,  it allows  an
indication of the currently active source.

Currently, NVP  makes no  explicit provisions  for this,  assuming that  the
network source address can be  used.  This  may fail if intermediate  agents
intervene between the  media source  and final  destination.   Consider  the
example in  Fig. 1.    An RTP-level  gateway is  defined as  an entity  that
transforms either  the RTP  header or  the RTP  media data  or both.    Such
a gateway  could for  example  merge two  successive packets  for  increased
------------------------------
 6. suggested by Steve Casner


H. Schulzrinne                   Expires 4/1/93                    [Page 22]


INTERNET-DRAFT                      RTP                     October 27, 1992

transport efficiency  or, probably  the most  common case,  translate  media
encodings for  each  stream,  say  from PCM  to  LPC  (called  transcoding).
A synchronizing  gateway is  defined  here as  a  gateway that  recreates  a
synchronous media  stream,  possibly  after  mixing several  sources.     An
application that mixes  all incoming  streams for  a particular  conference,
recreates a  synchronous audio  stream and  then  forwards it  to a  set  of
receivers is an example of a synchronizing gateway.  A synchronizing gateway
could be built from two end system applications, with the first  application
feeding the media output  to the media input  of the second application  and
vice versa.

In figure 1, the  gateways are used to  translate audio encodings, from  PCM
and ADPCM to LPC. The  gateway could be either synchronizing  or not.   Note
that a resynchronizing gateway is only necessary if audio packets depend  on
their predecessors and thus cannot be  transcoded independently.  It may  be
advantageous if the  packetization interval  can be  increased.   Also,  for
connections that are  barely able  to handle one  active source  at a  time,
mixing at the gateway avoids excessive queueing delays when several  sources
are active at the same time.   A synchronizing gateway has the  disadvantage
that it always increases the end-to-end delay.

We define  reflectors as  transport-level  entities that  translate  between
transport protocols, but  leave the  RTP protocol  unit untouched.   In  the
figure, the reflector connects  a multicast group to  a group of hosts  that
are not multicast capable by performing transport-level replication.

We define  an end  system as  an entity  that receives  and generates  media
content, but does not forward it.

We define three types of sources:  the media source is the actual origins of
the media, e.g., the talker in an audiocast; a synchronization source is the
combination of several media sources with its own timing; network source  is
the network-level origin as seen by the end system receiving the media.

The end  system has  to  synchronize its  playout with  the  synchronization
source, indicate the active party according  to the media source and  return
media to the  network source.   If an  end system receives  media through  a
resynchronizing gateway, the end system will see the gateway as the  network
and synchronization source, but  the media sources  should not be  affected.
The reflector does  not affect  the media  or synchronization  sources,  but
the reflector becomes the network source.   (Note that having the  reflector
change the IP source address is not  possible since the end systems need  to
be able to return their media to the reflector.)

vat audio  packets include  a  variable-length list  of  at most  64  4-byte
identifiers containing all media sources of  the packet.  However, there  is
no convenient way to distinguish the synchronization source from the network
source.   The end  system needs to  be able  to distinguish  synchronization
sources because  jitter  computation  and  playout  delay  differ  for  each
synchronization source.


H. Schulzrinne                   Expires 4/1/93                    [Page 23]


INTERNET-DRAFT                      RTP                     October 27, 1992


/-------\        +------+
_       _  ADPCM _      _
_ group _<------>_  GW  _--\ LPC
_       _        _      _   \    /------ end system
\-------/        +------+    \_\/
                    reflector _ >------- end system
/-------\        +------+    /_/\
_       _  PCM   _      _   /    \------ end system
_ group _<------>_  GW  _--/ LPC
_       _        _      _
\-------/        +------+

<---> multicast
                        Figure 1:  Gateway topology


Rather than having  the gateway (which  may be unaware  of the existence  of
a reflectors  down stream)  insert a  synchronization source  identifier  or
having the reflector know about the  internal structure of RTP packets,  the
current ad-hoc encapsulation solution used by Nevot may be sufficient:   the
reflector simply prefixes the the true network  address (and port?)  of  the
last source (either the gateway or  media source, i.e., the  synchronization
source) to  the RTP  packet.   Thus,  each  end system  and gateway  has  to
be aware  whether  it is  being  served by  a  reflector.    Also,  multiple
concatenated reflectors are difficult to handle.


3.7.2 Address Format Issues


The limitation to four bytes of addressing information may not be  desirable
for a number of reasons.  Currently, it is used to hold an IP address.  This
works as long as  four bytes are  sufficient to hold  an identifier that  is
unique throughout the  conference and  as long as  there is  only one  media
source per IP  address.   The latter assumption  tends to be  true for  many
current workstations, but it is easy to imagine scenarios where it might not
be, e.g., a system  could hold a number of  audio cards, could have  several
audio channels (Silicon Graphics systems, for  example) or could serve as  a
multi-line telephone interface.(7)

The combination of IP address and source port can identify multiple  sources
per site if  each media source  uses a different  network source port.    It
does not seem appropriate  to force applications to  allocate ports just  to
distinguish sources.  In the PBX  example a single output port would  appear

------------------------------
 7. If we are  willing to forego  the identification with  a site, we  could
have a multiple-audio channel site pick  unused IP addresses from the  local
network and associate it with the second and following audio ports.


H. Schulzrinne                   Expires 4/1/93                    [Page 24]


INTERNET-DRAFT                      RTP                     October 27, 1992

to be  the appropriate  method for  sending all  incoming calls  across  the
network.

Given the discussion of longer address formats at least in the longer  term,
it seems appropriate to  consider allowing for variable-length  identifiers.
Ideally, the identifier would identify the agent, not a computer or  network
interface.(8)   A currently  viable implementation is  the concatenation  of
the IP address and  some locally unique  number.  The  meaning of the  local
discriminator is opaque  to the outside  world; it  appears to be  generally
easier to have a local unique id service than a distributed version thereof.
Possibilities for  the  local  discriminator  include  the  numeric  process
identifier (plus some  distinguishing information  within the  application),
the network source port number or a numeric user identifier.

For efficiency  in  the common  case  of  one source  per  workstation,  the
convention (used  in vat)  of using  the  network source  address,  possibly
combined with  the user  id or  source port,  as media  and  synchronization
source should be maintained.


3.8 Energy Indication


G.764 contains a  4-bit noise energy  field, which encodes  the white  noise
energy to be  played by  the receiver  in the  silences between  talkspurts.
Playing silence periods as white  noise reduces the noise-pumping where  the
background noise  audible during  the  talkspurt is  audibly absent  at  the
receiver during  silence periods.    Substituting  white noise  for  silence
periods at the receiver is  not recommended for multi-party conferences,  as
the summed background noise  from all silent  parties would be  distractive.
Determining the proper noise level appears to be difficult.  It is suggested
that the receiver simply takes the energy of the last packet received before
the beginning of a silence period as an indication of the background  noise.
With this mechanism,  an explicit  indication in  the packet  header is  not
required.


3.9 Error Control


It remains to be  decided whether the  header, the  whole packet or  neither
should be protected by checksums.  NVP protects its header only, while G.764
------------------------------
 8. In  the  United  States,  a  one  way  encryption  function  applied  to
the social  security number  would serve  to identify  human agents  without
compromising the SSN itself, given that the likelihood of identical SSNs  is
sufficiently small.  The use of a telephone number may be less controversial
and is applicable  world-wide, but  may require some  local coordination  if
numbers are shared.


H. Schulzrinne                   Expires 4/1/93                    [Page 25]


INTERNET-DRAFT                      RTP                     October 27, 1992

has a single 16-bit check sequence  covering both datalink and packet  voice
header.  However, if UDP is used as the transport protocol, a checksum  over
the whole packet is already computed by the receiver.  (Checksumming for UDP
can typically be disabled  by the sending  or receiving host.)   ST-II  does
not compute checksums for either header or  data.  Many data link  protocols
already discard packets with bit errors, so that packets are rarely rejected
due to higher-layer checksums.

Bit errors within the data part are probably easier to tolerate than a  lost
packet, particularly since some media encoding formats may provide  built-in
error correction.   The impact  of bit  errors within the  header can  vary;
for example, errors within  the timestamp may cause  the audio packet to  be
played out at the wrong time, probably much more noticeable than  discarding
the packet.  Other  noticeable effects are caused  by a wrong conference  ID
or false encoding (if present).   If a separate checksum is desired for  the
cases where the underlying protocols do  not already provide one, it  should
be optional.   Once optional, it  would be easy  to define several  checksum
options, covering just  the header, the  header plus a  certain part of  the
body or the whole packet.

A checksum can also be used to  detect whether the receiver has the  correct
decryption key, avoiding noise  or (worse) denial-of-service  attacks.   For
that application, the checksum should  be computed across the whole  packet,
before encrypting the content.  Alternatively, a well-known signature  could
be added to  the packet and  included in  the encryption, as  long as  known
plaintext does not weaken the encryption security.

Recommendation:  optional for header; if not used, 4-byte signature in data.


3.10 Security


3.10.1 Encryption


Only encryption can  provide privacy as  long as intruders  can monitor  the
channel.   It is desirable  to specify an  encryption algorithm and  provide
implementations without  export  restrictions.    DES  is  widely  available
outside the United  States and  could easily  be added  even to  binary-only
applications by dynamic linking.

We have the choice  of either encrypting  both the header  and data or  only
the data.   Encrypting the header denies  the intruder knowledge about  some
conference details (for example, who the participants are, although this  is
only true as long  as the UDP  source address does  not already reveal  that
information).  It also allows some heuristic detection of key mismatches, as
the version identifier, timestamp and other header information are  somewhat
predictable.  However, header  encryption makes packet traces and  debugging
by external programs difficult.


H. Schulzrinne                   Expires 4/1/93                    [Page 26]


INTERNET-DRAFT                      RTP                     October 27, 1992

Public key cryptography does not work  for true multicast systems since  the
public encoding key for every recipient  differs, but it may be  appropriate
when used in  two-party conversations  or application-level multicast.    In
that case,  mechanisms similar  to privacy  enhanced mail  will probably  be
appropriate.  Key distribution for  non-public key encryption is beyond  the
scope of this recommendation.

For one-way  applications,  it  may  desirable to  prohibit  listeners  from
interrupting the  broadcast.   (After  all, since  live lectures  on  campus
get disrupted fairly  often, there  is reason  to fear  that a  sufficiently
controversial lecture carried on the Internet would suffer a similar  fate.)
Again, asymmetric  encryption can  be used.    Here, the  decryption key  is
made available to  all receivers,  while the  encryption key  is known  only
to the legitimate sender.   Current public-key  algorithms are probably  too
computationally intensive for all  but low-bit-rate voice.   In most  cases,
filtering based on sources will be sufficient.


3.10.2 Authentication


The usual message digest methods are applicable if only the integrity of the
message is to be protected against spoofing.


3.11 Quality of Service Control


Because  real-time  services   cannot  afford  retransmissions,   they   are
immediately affected  by  packet  loss  and  delays.     For  debugging  and
monitoring purposes,  it is  useful to  know exactly  where and  why  losses
occur.   Losses  occur either  within the  network or  because of  excessive
delay within the application.  To  determine the fraction of losses and  the
amount of network  loss, knowledge  about the number  of frames  transmitted
is required.   A packet sequence number  with sufficient range provides  the
most reliable and easiest to  implement method of gauging  packet loss.   If
a sequence number is  not available, it is  difficult to impossible for  the
receiver to get an  accurate count of  the packets transmitted.   Thus,  the
following RTCP service is suggested for that case.

An RTCP message of type PC (packet count) contains two 32-bit integers,  one
containing the timestamp  when the  measurement was  taken,  the second  the
number of transmitted samples, bytes, packets, or the amount of  audio/video
measured in seconds,  expressed as  a 16/16 timestamp.   To  make it  easier
for the receiver  to use  that information,  the sample should  be taken  at
a synchronization point, indicated  by the synchronization  bit in the  data
packet (see Section 3.5.1).  Since this field is intended to measure network
packet loss, a packet or  byte count would be  the simplest to maintain,  as
the meaning of sample depends on the packet content, for example the  number
of channels, the encoding, whether it's audio or video and so on.


H. Schulzrinne                   Expires 4/1/93                    [Page 27]


INTERNET-DRAFT                      RTP                     October 27, 1992

The  receiver  simply  stores  the  number  of  received  samples  at   each
synchronization point and then, after receiving the PC packet, can determine
the fraction  of packets  lost so  far.    Packet reordering  may  introduce
a slight  inaccuracy if  the packet  sent before  the synchronization  point
arrives afterwards.    Given that  there  typically is  a gap  between  that
last packet  and  the  synchronization  point,  this  occurrence  should  be
sufficiently unlikely as to leave  the loss measurement accurate enough  for
QOS monitoring.  This method avoids cumulative errors inherent in  estimates
based purely on timestamps.


4 Conference Control Protocol


Currently, only  conference control  functions  used for  loose  conferences
(open admission,  no explicit  conference set-up)  have been  considered  in
depth.  Support for the following functionality needs to be specified:


  o authentication

  o floor control, token passing

  o invitations, calls

  o discovery of conferences and resources (directory service)

  o media, encoding and quality-of-service negotiation

  o voting

  o conference scheduling


The functional specification of a conference control protocol is beyond  the
scope of this draft.


5 Packet Format


Given the above technical justifications,  the following packet formats  are
proposed:


5.1 Data


The data packet header  format is shown  in Figure 2.   The optional  16-bit
framing field and  the optional  32-bit IP address  designating the  network

H. Schulzrinne                   Expires 4/1/93                    [Page 28]


INTERNET-DRAFT                      RTP                     October 27, 1992

source are not shown.   All integer fields are  in network byte order  (most
significant byte first).

The content of the fields is defined as follows:


protocol version: two-bit version  identifier.   The initial version  number
    is one.  The value of zero is reserved for the current vat protocol.

sync (S): synchronization bit, described in Section 3.5.1.

media: media  encoding.    The five  bits  form an  index  into a  table  of
    encodings  defined out-of-band.    If no  mapping  has been  defined,  a
    standard mapping  to be specified  by the IANA  is used.   The value  of
    zero  is reserved  and indicates  that  the encoding  is carried  as  an
    option of type  MEDIA. The value of one  is reserved and indicates  that
    the encoding  is specified  in RTCP  packets or  the conference  control
    protocol.  If  a packet with a media  field value of one arrives and  no
    encoding is  known from  the conference control  protocol, the  receiver
    should defer  playing  these packets  until a  control packet  has  been
    received.   If  the packet does  not contain  a MEDIA  option, the  last
    defined encoding is used.

option length: number  of 32-bit  long words  contained within  the  options
    immediately following the header.

sequence number: 16-bit sequence number counting packets.

timestamp: timestamp, reflecting real  time.  The timestamp consists of  the
    middle 32 bits of an NTP timestamp.


 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_Ver_S_ media   _ option length _ sequence number               _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ timestamp (seconds)           _ timestamp (fraction)          _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                     Figure 2:  RTP data packet format

The packet  header is  followed by  options,  if any,  and the  media  data.
Optional fields are summarized  in Table 4.   Unless  otherwise noted,  each
option may appear only once per packet.  Each packet may contain any  number
of options.  Each option  consists of the one-byte option type  designation,
followed by a  one-byte length  field denoting  the total  number of  32-bit
words comprising the option, followed by any option-specific data.   Options
are aligned to  the natural  length of  the field,  i.e.,  16-bit words  are
aligned on even addresses, 32-bit  words are aligned at addresses  divisible


H. Schulzrinne                   Expires 4/1/93                    [Page 29]


INTERNET-DRAFT                      RTP                     October 27, 1992

by four, etc.  Options  unknown to the application are  to be ignored.   The
MEDIA option, if preent, must precede all other options whose interpretation
depends on the current encoding.  Currently, no such options are defined.

type___description______________________________________
MSRC   Globally  unique  media source  identifier.   A
       packet  may  contain multiple  options  of this
       type,  indicating all  contributors.   A source
       is  identified  by a  globally  unique six-byte
       string.     The  concatenation  of  a  two-byte
       numeric   user  id  unique  within  the  system
       followed  by  a four-byte  Internet  address is
       used(9).    If missing,  the network  source is
       considered the media source.
SSRC   Globally  unique synchronization source identi-
       fier.   The format is the  same as for the MSRC
       option.    If  missing, the  network  source is
       considered the synchronization source.
MEDIA  media  encoding  identification,  as  discussed
       in  Section  3.4.   The  first  byte designates
       the  encoding, with  values of 128  through 255
       reserved  for experimental  encodings.   Values
       of  0  through 127  are assigned  by  the IANA.
       Encoding-specific   parameters  follow.     The
       parameter  string  is padded  with  zeros until
       the  option  has a  length  divisible  by four.
       For  audio encodings, a single  byte contains a
       two-bit  channel count in  the most significant
       bits   and  a  six-bit  index  an  IANA-defined
       table  of  sampling  frequencies  in the  least
       significant  bits.    An  index  value  of zero
       designates   the  natural   sampling  frequency
       defined for each encoding.
ENERG  Energy  indication.   The length  and interpre-
       tation  of  this field  is  media-dependent and
       specified  for each encoding.   The ENERG field
       must follow the MEDIA field, if present.
BOP    (beginning  of  playout  unit) 16-bit  sequence
       number  designating the first packet within the
       current playout unit.

                         Table 4:  Optional fields


5.2 Control Packets


The scope of  RTCP is  meant to  be limited  to a  single medium,  conveying
minimal out-of-band state information during a conference.  Thus, any  means
of providing reliability  are beyond  its scope.    A version  field is  not
needed since new  control message  types can be  defined readily.    Control


H. Schulzrinne                   Expires 4/1/93                    [Page 30]


INTERNET-DRAFT                      RTP                     October 27, 1992

packets are sent periodically to the  same multicast group as data  packets,
using the same  time-to-live value.   The period  should be varied  randomly
to avoid synchronization of all sources.   The period determines how long  a
new receiver has to wait  in the worst case time  until it can identify  the
source.  The control packets defined here extend the functionality found  in
vat session packets.

Control packets consists  of one  or more items  using the  same format  and
alignment as options within the data  packet.  Non-overlapping type  numbers
for data packet options  and control message  items are to  be assigned,  so
that control information  could be carried  in data packets  if so  desired.
The packet format is shown  in Fig. 3, while the  item types are defined  in
Fig. 5.  Padding is  used to align fields to multiples  of four bytes.   The
value used for padding is undefined.


A Port Assignment


Since it is anticipated  that UDP and  similar port-oriented protocols  will
play a major  role in  carrying RTP traffic,  the issue  of port  assignment
needs to  be addressed.    The way  ports are  assigned mainly  affects  how
applications can extract the  packets destined for them.   For each  medium,
there also needs  to be  a mechanism  for distinguishing  data from  control
packets.

For unicast  UDP, only  the  port number  is available  for  demultiplexing.
Thus, each media  will need a  separate port number  pair unless a  separate
demultiplexing agent  is  used.     However,   for  one-to-one  connections,
dynamically negotiating a port number is easy.   If several UDP streams  are
used to provide multicast, the port number issue becomes more thorny.

For connection-oriented  protocols like  ST-II or  TCP, only  packets for  a
particular connection reach the application.

For UDP multicast, an application can select to receive only packets with  a
particular port number and multicast  address by binding to the  appropriate
multicast address.  Thus, for UDP multicast, there is no need to distinguish
media by port numbers, as each medium is assumed to have its designated  and
unique multicast group.   Any dynamic port  allocation mechanism would  fail
for large,  dynamic multicast  groups, but  might be  appropriate for  small
conferences and two-party conversations.

Data and control packets for a single medium can either share a single  port
or use two different  port numbers.   (Currently, two adjacent port  numbers
are used.)  A single port for data and control simplifies the receiver  code
and conserves port  numbers.   It requires some  other means of  identifying
control packets, for example as a special media code, and does not allow the
sharing of a single control port by several applications.


H. Schulzrinne                   Expires 4/1/93                    [Page 31]


INTERNET-DRAFT                      RTP                     October 27, 1992


 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=ENERG    _  length=1     _ energy level                  _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=MSRC     _  length=2     _ user id                       _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ IP address of media source                                    _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=SSRC     _  length=2     _ user id                       _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ IP address of synchronization source                          _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=MEDIA    _  length=2     _ encoding      _ch#_sampling f._
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ encoding-specific parameters                                  _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=BOP      _  length=1     _ first seq.# in playout unit   _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                       Figure 3:  RTCP packet format


H. Schulzrinne                   Expires 4/1/93                    [Page 32]


INTERNET-DRAFT                      RTP                     October 27, 1992


type___description______________________________________

ID     The  media  or  synchronization source  identi-
       fier,  using  the  same  6-byte  format as  the
       MSRC   and  SSRC  options.     This  identifier
       applies  to  all  following  fields  until  the
       next  ID field.   This results  in more compact
       coding  when application gateways  are used and
       allows  aggregation of several sources into one
       control message.
ALIAS  A  variable-length string padded  with zeros so
       that  the total  length of the  item, including
       the  type  and  length  bytes,  is  a  multiple
       of  four  bytes.    The  content  of the  field
       describes  the media  source identified  by the
       most  recent ID  item,  for example,  by giving
       the  name  and  affiliation  of  the talker  or
       the  call  letters of  the radio  station being
       rebroadcast.   The content  is not specified or
       authenticated.   The  text is  encoded as 7-bit
       US ASCII  values from 32 to 127 (decimal).  The
       escape  mechanism for character sets other than
       US-ASCII  text  remains to  be must  be defined
       (ISO 2022?).
DESC   Media   content  description,   with  the  same
       format   as  ALIAS.  The  field  describes  the
       current  media content.    Example applications
       include  the  session  title  for a  conference
       distribution,   or the  current  program  title
       for  radio or television redistribution through
       packet networks.
BYE    The  site specified by the most recent ID field
       requests  to  be dropped  from  the conference.
       No  further  data.     Padded  to 32  bit  word
       length.
PC     16   bits   of  padding   are  followed   by  a
       16/16  32-bit  timestamp  (same  format as  the
       synchronization  timestamp) and a 32 bit packet
       count.     The  item  specifies  the number  of
       packets  transmitted by the sender of this RTCP
       message up to the time specified.
TIME   16  bits  of padding  wallclock time  and media
       clock, both expressed as 16/16 timestamps.
MEDIA  media description (see Table 4)
                      Table 5:  The RTCP message types


H. Schulzrinne                   Expires 4/1/93                    [Page 33]


INTERNET-DRAFT                      RTP                     October 27, 1992

Using a  single  RTCP  stream  for several  media  may  be  advantageous  to
avoid duplicating,  for example,  the  same identification  information  for
voice, video  and whiteboard  streams.   This  works only  if  there is  one
multicast group  that all  members of  a  conference subscribe  to.    Given
the relatively low frequency  of control messages,  the coordination  effort
between applications and the necessity  to designate control messages for  a
particular medium are probably reasons enough to have each application  send
control messages to the same multicast group as the data.

In conclusion, for multicast  UDP, two assigned port  numbers, one for  data
and one for control, seem to offer the most flexibility.


B Multicast Address Allocation


A fixed allocation of network multicast addresses to conferences is  clearly
not feasible, since the  lifetime of conferences  is unknown, the  potential
number of conferences is rather large and the available number space limited
to about 228, of which 216 have been assigned to conferences.

Dynamic allocation  of addresses  without intervention  of some  centralized
clearing house mechanism  appears to be  difficult.   One approach would  be
akin to carrier sense multiple access:  A conference originator would listen
on a randomly selected multicast address using the session port (it is  left
as an exercise  to the  reader to  imagine what happens  if a  data port  is
used).  Within a small  multiple of the session announcement interval  (with
vat, this  interval averages  six seconds),  we would  have some  indication
of whether the  address is in  use.   This technique may  fail for a  number
of reasons.   First,  collisions if  the same multicast  address is  checked
nearly simultaneously are possible, if unlikely as long as the number  space
is only  sparsely utilized.    More  seriously, it  is quite  possible  that
multicast islands using the same multicast  group are unaware of each  other
as they are isolated due  to time-to-live restrictions or temporary  network
interruptions.  It is clearly undesirable to be forced to renegotiate a  new
multicast address in the middle of a conference because time-to-live  values
or network connectivity have changed.

It appears  to  the  author  that since  multicasting  takes  place  at  the
IP-level, we  would  have to  check all  potential  ports to  avoid  drawing
multicast traffic with the same group but different destination port towards
us.  Some IP-level mechanism would have  to be added to the kernel to  avoid
having to scan all ports.

A probe packet sent with maximum  time-to-live to the desired address  would
avoid missing  time-to-live-isolated  islands and  detect  temporarily  idle
multicast groups,  but would impose  a rather  severe load  on the  network,
without solving temporary network splittings.   Probe packets and  responses
could also get lost.   Using probe packets  also requires an agreement  that
all potential users of the range of multicast addresses would indeed respond


H. Schulzrinne                   Expires 4/1/93                    [Page 34]


INTERNET-DRAFT                      RTP                     October 27, 1992

to a probe packet.

Using the conference identifier  at the RTP level  to detect collisions  may
have severe performance consequences for both the network and the  receiving
host if the  conference sharing  the same  multicast group  happens to  send
high-bandwidth data.

One solution would  be to  provide a hierarchical  allocation of  addresses.
Here, the originator of a conference  asks the nearest address provider  for
an available address.   The  provider in turn  asks the next  level up  (for
example, the regional network) or  a peer if it  had temporarily run out  of
addresses.  The conference originator would be responsible for returning the
address after use.   The return of addresses after  use raises the issue  of
what happens if  either the requesting  agent or the  issuer of the  address
crashes.  A timeout mechanism is  probably most robust.  Addresses could  be
issued for a certain number of hours.  If the original requester renews  the
request before the  expiration of the  timeout period, it  is guaranteed  to
have the request granted.  With that policy, requester or issuer crashes can
be handled gracefully under  most circumstances.   It remains to be  decided
what a conference originator is supposed to do if an address renewal request
fails because  the address  provider has  crashed or  connectivity has  been
lost.

It is imaginable  that each  site would  pay an access  fee for  a block  of
addresses, similar to  the access-speed  dependent fee  charged for  network
connectivity within  the Internet.    This  would provide  local  incentives
for each administrative  domain (AD) to  recoup unused addresses.    Trading
of smaller  address  blocks  between friendly  ADs  could  accommodate  peak
demands or clearing-house failures, similar to the mutual support agreements
between electrical utilities.    For increased  reliability,  each AD  could
offer multiple clearing-houses, just as it typically maintains several  name
servers.

As an extension, it may be desirable to distinguish multicast addresses with
different reach.  A local address would be given out with the restriction of
a maximum time-to-live value and could thus be reused at an AD  sufficiently
removed, akin  to the  combination of  cell reuse  and power  limitation  in
cellular telephony.  Given that  many conferences will be local or  regional
(e.g., broadcasting classes to nearby campuses  of the same university or  a
regional group of universities, or an electronic town meeting), this  should
allow significant reuse of addresses.   Reuse of addresses requires  careful
engineering of thresholds and would probably  only be useful for very  small
time-to-live values that restrict reach to a single local area network.

The proposed allocation  mechanism has no  single point  of failure,  scales
well  and  conserves  the  addressing  resources  by  providing  appropriate
incentives, combined with  local control.   It  requires sufficient  address
space to supply the hierarchy.(10)  The address allocation may or may not be

------------------------------
10. The ideas presented here are compatible with the more general  proposals


H. Schulzrinne                   Expires 4/1/93                    [Page 35]


INTERNET-DRAFT                      RTP                     October 27, 1992

handled by the same authority that provides conference naming and  discovery
services.


C Glossary


The glossary  below  briefly defines  the  acronyms used  within  the  text.
Further definitions can be found in the Internet draft


  draft-ietf-userglos-glossary-00.txt


available for anonymous ftp from nnsc.nsf.net and other sites.  Some of  the
general Internet definitions below are copied from that glossary.


16/16 timestamp: A  32-bit integer timestamp  consisting of  a 16-bit  field
    containing the number of  seconds followed by a 16-bit field  containing
    the binary fraction of a second.  This timestamp can measure  about 18.2
    hours with a resolution of approximately 15 =s.

ADPCM: Adaptive   differential  pulse  code   modulation.      Rather   than
    transmitting  !  PCM  samples  directly,  the  difference  between  the
    estimate of the next sample and the actual sample is transmitted.   This
    difference is usually small and  can thus be encoded in fewer bits  than
    the sample itself.  The ! CCITT recommendations G.721, G.723, G.726 and
    G.727 describe ADPCM encodings.

CCITT: Comite  Consultatif International  de Telegraphique  et  Telephonique
    (CCITT). This organization is part of the United  National International
    Telecommunications Union (ITU)  and is responsible for making  technical
    recommendations about telephone and  data communications systems.   X.25
    is an example of a  CCITT recommendation.  Every four years CCITT  holds
    plenary sessions where they adopt new recommendations.   Recommendations
    are known by the color of the cover of the book they are contained in.

CELP: code-excited  linear prediction;  audio  encoding method  for  low-bit
    rate codecs.

CD: compact disc.

codec: short  for coder/decoder;  device  or software  that  ! encodes  and
    decodes audio or video information.

------------------------------
contained in ``Remote  Conferencing Architecture'' by  Yee-Hsiang Chang  and
Jon Whaley.


H. Schulzrinne                   Expires 4/1/93                    [Page 36]


INTERNET-DRAFT                      RTP                     October 27, 1992

companding: reducing the  dynamic range  of audio or  video by a  non-linear
    transformation of the sample values.   The best known methods for  audio
    are =-law, used  in North America, and A-law,  used in Europe and  Asia.
    !G.711 [10]

DAT: digital audio tape.

encoding: transformation of the  media content for transmission, usually  to
    save bandwidth, but also to decrease the effect of  transmission errors.
    Well-known encodings are  G.711 (=-law PCM), and  ADPCM for audio,  JPEG
    and MPEG for video.  ! encryption

encryption: transformation  of the  media content  to ensure  that only  the
    intended recipients can make use of the information.  ! encoding

end system: host  where conference participants  are located.   RTP  packets
    received by  an end system are  played out, but  not forwarded to  other
    hosts (in a manner visible to RTP).

frame: unit of information.   Commonly used for  video to refer to a  single
    picture.   For audio, it  refers to a data  that forms a encoding  unit.
    For example,  an LPC  frame consists  of the  coefficients necessary  to
    generate a specific number of audio samples.

G.711: ! CCITT  recommendation for !  PCM audio encoding at  64 kb/s using
    =-law or A-law companding.

G.764: ! CCITT recommendation for packet voice; specifies both ! HDLC-like
    data link  and network layer.   In  the draft stage,  this standard  was
    referred to as G.PVNP. The standard is primarily geared  towards digital
    circuit multiplication  equipment used by  telephone companies to  carry
    more voice calls on transoceanic links.

G.PVNP: designation of CCITT recommendation ! G.764 while in draft status.

GSM: Group Speciale  Mobile.   In general,  designation for European  mobile
    telephony standard.   In particular, often used  to describe the 8  kb/s
    audio coding used.

H.261: ! CCITT recommendation for the  compression of motion video at rates
    of Px 64 kb/s.  Originally intended for narrowband !ISDN.

hangover: Audio data transmitted  after the silence detector indicates  that
    no audio  data is present.   Hangover  ensures that  the ends of  words,
    important for comprehension, are transmitted even though they  are often
    of low energy.

HDLC: high-level  data  link control;   standard data  link  layer  protocol
    (closely related to LAPD and SDLC).


H. Schulzrinne                   Expires 4/1/93                    [Page 37]


INTERNET-DRAFT                      RTP                     October 27, 1992

ICMP: Internet  Control  Message Protocol;   ICMP is  an  extension  to  the
    Internet Protocol.    It allows for  the generation  of error  messages,
    test packets and informational messages related to ! IP.

in-band: signaling information is  carried together (in the same channel  or
    packet) with the actual data.  ! out-of-band.

IP: internet protocol;  the Internet  Protocol, defined in  RFC 791, is  the
    network layer for  the TCP/IP Protocol Suite.   It is a  connectionless,
    best-effort packet switching protocol [11].

IP address: four-byte  binary host  interface  identifier used  by !IP  for
    addressing.    An  IP  address  consists  of a  network  portion  and  a
    host  portion.   RTP  treats IP  addresses  as globally  unique,  opaque
    identifiers.

IPv4: current version (4) of ! IP.

ISDN: integrated services digital  network; refers to an end-to-end  circuit
    switched  digital network  intended  to replace  the  current  telephone
    network.   ISDN  offers circuit-switched  bandwidth in  multiples of  64
    kb/s (B  or bearer  channel), plus  a 16 kb/s  packet-switched data  (D)
    channel.

JPEG: joint  photographic experts  group.   Designation  of a  variable-rate
    compression algorithm using  discrete cosine transforms for  still-frame
    color images.

LPC: linear predictive coder.   Audio encoding method that models speech  as
    a parameters of a linear filter; used for very low bit rate codecs.

loosely controlled conference: Participants   can   join   and   leave   the
    conference without  connection establishment or  notifying a  conference
    moderator.   The identity of conference participants  may or may not  be
    known to other participants.  See also:  tightly controlled conference.

MPEG: motion picture experts group.  Designates a  variable-rate compression
    algorithm for full motion  video at low bit rates; uses both  intraframe
    and interframe coding.

media source: entity  (user  and  host) that  produced  the  media  content.
    It  is the  entity  that  is shown  as  the active  participant  by  the
    application.

MTU: maximum transmission unit;  the largest frame length which may be  sent
    on a physical medium.

Nevot: network voice terminal; application written by the author.

network source: entity denoted by address and  port number from which the !
    end system receives the RTP packet and to which the end system  send any


H. Schulzrinne                   Expires 4/1/93                    [Page 38]


INTERNET-DRAFT                      RTP                     October 27, 1992

    RTP packets for that conference in return.

NVP: network voice  protocol, original  packet format used  in early  packet
    voice experiments; defined in RFC 741 [3].

OSI: Open  System  Interconnection;   a suite  of  protocols,   designed  by
    ISO  committees,  to  be the  international  standard  computer  network
    architecture.

out of band: signaling  and control  information is  carried in  a  separate
    channel or  separate packets from the  actual data.   For example,  ICMP
    carries control information  out-of-band, that is, as separate  packets,
    for IP, but both ICMP and IP usually use the same  communication channel
    (in band).

PCM: pulse-code modulation;  speech coding where speech is represented by  a
    given number  of fixed-width samples  per second.   Often  used for  the
    coding employed in the telephone network:  64,000 eight-bit  samples per
    second.

playout: Delivery of  the medium  content to the  final consumer within  the
    receiving host.   For audio, this implies digital-to-analog  conversion,
    for video display on a screen.

PVP: packet video protocol; extension of ! NVP to video data [12]

SB: subband; as in subband  codec.  Audio or video encoding that splits  the
    frequency content of a  signal into several bands and encodes each  band
    separately, with the  encoding fidelity matched to human perception  for
    that particular frequency band.

RTCP: real-time control protocol; adjunct to ! RTP.

RTP: real-time transport protocol; discussed in this draft.

ST-II: stream  protocol;    connection-oriented  unreliable,   non-sequenced
    packet-oriented network  and transport  protocol  with process  demulti-
    plexing and  provisions for  establishing flow  parameters for  resource
    control; defined in RFC 1090 [2].

TCP: transmission  control protocol;  An Internet  Standard transport  layer
    protocol  defined  in   RFC  793.      It  is  connection-oriented   and
    stream-oriented, as opposed to UDP [13].

TPDU: transport protocol data unit.

tightly controlled conference: Participants  can  join the  conference  only
    after an invitation  from a conference moderator.   The identify of  all
    conference participants is known to the moderator.  !loosely controlled
    conference.


H. Schulzrinne                   Expires 4/1/93                    [Page 39]


INTERNET-DRAFT                      RTP                     October 27, 1992

transcoder: device   or   application  that   translates   between   several
    encodings, for example between ! LPC and ! PCM.

UDP: user  datagram  protocol;   unreliable,  non-sequenced   connectionless
    transport protocol defined in RFC 768 [14].

vat: Visual audio application (voice terminal) written by Steve  McCanne and
    Van Jacobson.

vt: Voice terminal software written at the Information Sciences Institute.

VMTP: Versatile message transaction protocol; defined in RFC 1045 [15].


D Address of Author


Henning Schulzrinne
AT&T Bell Laboratories
MH 2A244
600 Mountain Avenue
Murray Hill, NJ 07974
telephone:  908 582-2262
electronic mail:  hgs@research.att.com


References


 [1] S.  Casner, C.  Lynn,  Jr.,  P. Park,  K.  Schroder, and  C.  Topolcic,
     ``Experimental internet  stream protocol, version 2 (ST-II),''  Network
     Working  Group Request  for  Comments RFC  1190,  Information  Sciences
     Institute, Oct. 1990.

 [2] C. Topolcic, ``ST II,'' in First International Workshop on  Network and
     Operating System Support for Digital Audio and Video, no.  TR-90-062 in
     ICSI Technical Reports, (Berkeley, CA), 1990.

 [3] D.  Cohen,  ``Specification for  the  network voice  protocol  (nvp),''
     Network Working Group Request for Comment RFC 741, ISI, Jan. 1976.

 [4] N.  Borenstein  and  N.  Freed,   ``MIME  (multipurpose  internet  mail
     extensions)  mechanisms for  specifying and  describing the  format  of
     internet message bodies,''  Network Working Group Request for  Comments
     RFC 1341, Bellcore, June 1992.

 [5] J.  G. Gruber  and L.  Strawczynski, ``Subjective  effects of  variable
     delay and speech clipping in dynamically managed voice  systems,'' IEEE
     Transactions on Communications, vol. COM-33, pp. 801--808, Aug. 1985.


H. Schulzrinne                   Expires 4/1/93                    [Page 40]


INTERNET-DRAFT                      RTP                     October 27, 1992

 [6] N. S. Jayant,  ``Effects of packet losses in waveform coded speech  and
     improvements due to an odd-even sample-interpolation procedure,''  IEEE
     Transactions on Communications, vol. COM-29, pp. 101--109, Feb. 1981.

 [7] D. Minoli,  ``Optimal packet length  for packet voice  communication,''
     IEEE Transactions  on Communications, vol.  COM-27, pp. 607--611,  Mar.
     1979.

 [8] V.  Jacobson,   ``Compressing  TCP/IP  headers  for  low-speed   serial
     links,'' Network Working Group Request for Comments RFC  1144, Lawrence
     Berkeley Laboratory, Feb. 1990.

 [9] D. L. Mills, ``Network time protocol (version 2) ---  specification and
     implementation,'' Network Working Group Request for Comments  RFC 1119,
     University of Delaware, Sept. 1989.

[10] N.  S. Nayant  and  P. Noll,  Digital  Coding of  Waveforms.  Englewood
     Cliffs, NJ: Prentice Hall, 1984.

[11] J.  Postel, ``Internet protocol,''  Network Working  Group Request  for
     Comments RFC 791, Information Sciences Institute, Sept. 1981.

[12] R.  Cole, ``PVP -  a packet  video protocol,''  W-Note 28,  Information
     Sciences  Institute, University  of Southern  California, Los  Angeles,
     CA, Aug. 1981.

[13] J. B. Postel,  ``DoD standard transmission control protocol,''  Network
     Working  Group  Request for  Comments  RFC  761,  Information  Sciences
     Institute, Jan. 1980.

[14] J.  B.  Postel,  ``User  datagram  protocol,''  Network  Working  Group
     Request for Comments RFC 768, ISI, Aug. 1980.

[15] D.  R.  Cheriton,   ``VMTP:  Versatile  Message  Transaction   Protocol
     specification,'' in  Network Information Center RFC 1045, (Menlo  Park,
     CA), pp. 1--123, SRI International, Feb. 1988.


H. Schulzrinne                   Expires 4/1/93                    [Page 41]