Audio/Video Payload WG T. Schierl Internet Draft Fraunhofer HHI Intended status: Standards track S. Wenger Expires: August 2012 Vidyo Y.-K. Wang Qualcomm M. M. Hannuksela Nokia February 27, 2012 RTP Payload Format for High Efficiency Video Coding draft-schierl-payload-rtp-h265-00.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on August 27, 2012. Copyright and License Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. Wenger, et al Expires August 27, 2012 [Page 1] Internet-Draft RTP Payload Format for HEVC February 2012 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Schierl, et al Expires August 27, 2012 [Page 2] Internet-Draft RTP Payload Format for HEVC February 2012 Abstract This memo describes an RTP payload format for High Efficiency Video Coding (HEVC) [HEVC], which is currently being developed by the Joint Collaborative Team on Video Coding (JCT-VC). The RTP payload format allows for packetization of one or more Network Abstraction Layer (NAL) units in each RTP packet payload, as well as fragmentation of a NAL unit into multiple RTP packets. Furthermore, it supports transmission of an HEVC stream over a single as well as multiple RTP flows. The payload format has wide applicability in videoconferencing, Internet video streaming, and high bit-rate entertainment-quality video, among others. Table of Contents Status of this Memo ............................................. 1 Abstract ........................................................ 3 Table of Contents ............................................... 3 1 . Introduction ................................................ 5 1.1 . The HEVC Codec.......................................... 5 1.1.1 Overview ............................................ 5 1.1.2 Parallel Processing Support ......................... 6 1.1.3 Parameter Sets ..................................... 9 1.1.4 NAL Unit Header .................................... 9 1.2 . Overview of the Payload Format ........................ 11 2 . Conventions ................................................ 12 3 . Definitions and Abbreviations .............................. 12 3.1 Definitions ............................................. 12 3.1.1 Definitions from the HEVC Specification ............ 12 3.1.2 Definitions Specific to This Memo .................. 13 3.2 Abbreviations ........................................... 14 4 . RTP Payload Format ......................................... 14 4.1 RTP Header Usage......................................... 14 4.2 NAL Unit Header Usage ................................... 16 4.3 Payload Structures ...................................... 16 4.4 Transmission Modes ...................................... 17 4.5 Packetization Modes ..................................... 17 4.6 Decoding Order .......................................... 18 4.7 Aggregation Packets ..................................... 20 4.7.1 Single Time Aggregation Packet (STAP) .............. 22 4.8 Fragmentation Units (FUs) ............................... 24 5 . Packetization Rules ........................................ 28 5.1 Common Packetization Rules .............................. 28 5.2 Non-Interleaved mode .................................... 29 5.3 Interleaved mode......................................... 29 Schierl, et al Expires August 27, 2012 [Page 3] Internet-Draft RTP Payload Format for HEVC February 2012 6 . De-Packetization Process .................................. 29 6.1 Non-Interleaved Mode .................................... 30 6.2 Interleaved Mode......................................... 30 6.2.1 Size of the De-interleaving Buffer ................. 30 6.2.2 De-interleaving Process ............................ 31 6.3 Additional De-Packetization Guidelines .................. 33 7 . Payload Format Parameters ................................. 33 7.1 Media Type Registration ................................ 34 7.2 SDP Parameters .......................................... 39 7.2.1 Mapping of Payload Type Parameters to SDP .......... 39 7.2.2 Usage with the SDP Offer/Answer Model .............. 39 7.2.3 Usage with SDP Offer/Answer Model .................. 40 7.2.4 Usage in Declarative Session Descriptions .......... 40 7.2.5 Signaling of Parallel Processing ................... 40 7.3 Examples ................................................ 41 7.4 Parameter Set Considerations ............................ 41 8 . Security Considerations ................................... 41 9 . Congestion Control ......................................... 41 10 . IANA Consideration......................................... 41 11 . Informative Appendix: Application Examples ................ 41 11.1 Introduction ........................................... 41 11.2 Streaming .............................................. 41 11.3 Videoconferencing (Unicast to MANE, Unicast to Endpoints)41 11.4 Mobile TV (Multicast to MANE, Unicast to Endpoint) ..... 41 12 . Acknowledgements .......................................... 41 13 . References ................................................ 42 13.1 Normative References ................................... 42 13.2 Informative References ................................. 42 14 . Authors' Addresses......................................... 43 Schierl, et al Expires August 27, 2012 [Page 4] Internet-Draft RTP Payload Format for HEVC February 2012 1. Introduction 1.1. The HEVC Codec 1.1.1 Overview High Efficiency Video Coding [HEVC] is a forthcoming video coding standard under development by the Joint Collaborative Team on Video Coding (JCT-VC) formed by the ITU-T and ISO/IEC. It is reported to provide significantly coding efficiency gains over H.264 [H.264]. The standard will be found under ISO/IEC as ISO/IEC 23008-2, informally as MPEG H Part 2. ITU-T may decide soon on the final recommendation number. H.264 and HEVC share a similar hybrid video codec design. Conceptually, both technologies include a video coding layer (VCL), and a network abstraction layer (NAL). The VCL of HEVC includes a prediction stage that involves motion compensation and spatial intra-prediction, integer transforms applied to prediction residuals, and an entropy coding stage that uses an arithmetic coding. As in H.264, in-loop deblocking filtering is applied to the reconstructed picture. An important difference of HEVC compared to H.264 is the coding structure within a picture. In HEVC each picture is divided into treeblocks of up to 64x64 luma samples. Treeblocks can be recursively split into smaller Coding Units (CUs) using a generic quad-tree segmentation structure. CUs can be further split into Prediction Units (PUs) used for intra- and inter-prediction and Transform Units (TUs) defined for transform and quantization. HEVC includes integer transforms for a number of TU sizes. HEVC also includes two new in-loop filters that may be applied after the deblocking filtering: Sample Adaptive Offset (SAO) and Adaptive Loop Filter (ALF). On random accessibility provisioning, HEVC introduces besides Instantaneous Decoder Refresh (IDR) pictures a Clean Random Access (CRA) picture, which is similar to what has been conventionally called open Group-of-Pictures (GOP) intra picture. Compared to H.264 wherein a CRA picture may be signalled using a recovery point Supplemental Enhancement Information (SEI) message, in HEVC a distinct NAL unit type is used for indication of a CRA picture. Furthermore, HEVC specifies that a conforming bitstream may start with a CRA picture, compared to in H.264 a conforming must start with an IDR picture. Schierl, et al Expires August 27, 2012 [Page 5] Internet-Draft RTP Payload Format for HEVC February 2012 Temporal layer access (TLA) pictures were introduced in HEVC to indicate temporal layer switching points. Predictively coded pictures can include uni-predicted and bi- predicted slices. The flexibility in creating picture coding structures is roughly comparable to H.264. The VCL generates and consumes syntax structures designed to be adaptable to MTU sizes commonly found in IP networks, irrespective of the size of a coded picture. Picture segmentation is achieved through slices. A concept of "fine granularity slices" (FGS) is included that allows to create slice boundaries within a treeblock. The Network Adaptation Layer (NAL) is responsible for information required to the decoding process of more than one slice, which are collected in parameter sets. A number of data structures not strictly required for the decoding process, but potentially helpful in decoding systems can be conveyed in data structures such as Supplementary Enhancement Information (SEI) messages, Access unit delimiters, and so on. All the aforementioned MTU-sized (or smaller) data structures are available in the form of Network Adaptation Layer Units. The single distinguishing difference between HEVC and H.264 with respect to the RTP payload format design is the availability of VCL- based coding tools that are specifically designed to enable processing on high-level parallel architectures. These tools are described below in sufficient detail to provide motivation for the parallel processing signaling support that is described in section 7.2.5. 1.1.2 Parallel Processing Support The reportedly significantly higher computational demand of HEVC over H.264, in conjunction with the ever increasing video resolution (both spatially and temporally) required by the market, led to the adoption of VCL coding tools specifically targeted to allow for parallelization on the sub-picture level. That is, parallelization occurs, at the minimum, at the granularity of an integer number of treeblocks. The targets for this type of high-level parallelization are multicore CPUs and DSPs as well as multiprocessor systems. In a system design, to be useful, these tools require signaling support, which is provided in section 7.2.5 of this memo. This section provides a brief overview of the tools available in [HEVC]. This section is expected to be updated frequently as the HEVC draft evolves. Schierl, et al Expires August 27, 2012 [Page 6] Internet-Draft RTP Payload Format for HEVC February 2012 For parallelization, four picture partition strategies are available. Regular slices are segments of the bitstream that can be reconstructed independently from other regular slices within the same picture (though there may still be interdependencies through loop filtering operations). Regular slices are the only tool that can be used for parallelization that is also available, in virtually identical form, in H.264. Regular slices based parallelization does not require much inter-processor or inter-core communication (except for inter-processor or inter-core data sharing for motion compensation when decoding a predictively coded picture, which is typically much heavier than inter-processor or inter-core data sharing due to in-picture prediction), as slices are designed to be independently decodable. However, for the same reason, regular slices can require some coding overhead. Further, regular slices (in contrast to some of the other tools mentioned below) also serve as the key mechanism for bitstream partitioning to match MTU size requirements, due to the in-picture independence of regular slices and that each regular slice is encapsulated in its own NAL unit. In many cases, the goal of parallelization and the goal of MTU size matching can place contradicting demands to the slice layout in a picture. The realization of this situation led to the development of the more advanced tools mentioned below. This payload format does not contain any specific mechanisms aiding parallelization through regular slices. Entropy slices, like regular slices, break entropy decoding dependencies but allow prediction (and filtering) to cross slice boundaries. Insofar, they can be used as a lightweight mechanism to parallelize the entropy decoding, without having impact on other decoding steps. The lightweightness comes from that though each entropy slice is encapsulated into its own NAL unit, it has a much shorter slice header as most of the slice header syntax elements are not present and must be inherited from the preceding full slice header. Due to the allowance of in-picture prediction between neighboring entropy slices within a picture, the required inter- processor/inter-core communication to enable in-picture prediction can be substantial. Due to the same reason, entropy slices cannot be used for MTU size matching. Entropy slices appear to be only useful for system architectures that execute the entropy decoding process on a multicore/multi-CPU architecture, but execute the remaining decoding functionality on dedicated signal processing hardware. At the time of writing, entropy slices are not included in any profile defined in draft HEVC. No support of entropy slices is included in this memo. Schierl, et al Expires August 27, 2012 [Page 7] Internet-Draft RTP Payload Format for HEVC February 2012 In Wavefront Parallel Processing, the picture is partitioned into rows of treeblocks. Entropy decoding and prediction are allowed to use data from treeblocks in other partitions. Parallel processing is possible through parallel decoding of rows of treeblocks, where the start of the decoding of a row is delayed by two treeblocks, so to ensure that data related to a treeblock above and to the right of the subject treeblock is available before the subject treeblock is being decoded. Using this staggered start (which appears like a wavefront when represented graphically), parallelization is possible with up to as many processors/cores as the picture contains treeblock rows. At the time of writing, the draft HEVC includes a mechanism to organize the coded bits of different treeblock rows to be friendly to a particular number of parallel processors/cores. For example, it is possible that coded bits of even numbers of treeblock rows (treeblock rows 0, 2, 4, ...) all come before coded bits of odd numbers of treeblock rows (treeblock rows 1, 3, 5, ...), such that the bitstream is friendly to two parallel processors/cores, though decoding of an earlier-coming treeblock row (e.g. treeblock row 2) refers to an later-coming treeblock row (e.g. treeblock row 1). Similarly as entropy slices, due to the allowance of in-picture prediction between neighboring treeblock rows within a picture, the required inter-processor/inter-core communication to enable in-picture prediction can be substantial. The wavefront parellel processing partitioning does not result into more NAL units compared to when it is not applied, thus wavefront parellel processing cannot be used for MTU size matching. At the time of writing, wavefront parallel processing is not included in any profile of draft HEVC. This memo does not specify support for it. Tiles define horizontal and vertical boundaries that partition a picture into tile columns and rows. The scan order of treeblocks is changed to be local within a tile (in the order of a treeblock raster can of a tile), before decoding the top-left treeblock of the next tile in the order of tile raster scan of a picture. Similar to regular slices, tiles break in-picture prediction dependencies (including entropy decoding dependencies). However, they do not need to be included into individual NAL units (same as wavefront parallel processing in this regard), hence tiles cannot be used for MTU size matching. Each tile can be processed by one processor/core, and the inter-processor/inter-core communication required for in-picture prediction between processing units decoding neighboring tiles is limited to conveying the shared slice header in cases a slice is spanning more than one tile, and loop filtering related sharing of reconstructed samples and metadata. Insofar, tiles are less demanding in terms of memory bandwidth compared to WPP due to the in-picture independence between two neighboring partitions. Tiles are included in the (single) existing profile of Schierl, et al Expires August 27, 2012 [Page 8] Internet-Draft RTP Payload Format for HEVC February 2012 [EHVC] and the support in the context of this memo will be specified in section 7 of this memo. The interaction between regular slices and tiles is simplified by constraints of the HEVC draft. Specifically, for each slice and tile, either or both of the following conditions must be fulfilled: 1) all coded blocks in a slice belong to the same tile; 2) all coded blocks in a tile belong to the same slice. 1.1.3 Parameter Sets The parameter set concept is borrowed from [H.264]. In addition to Sequence Parameter Sets (SPS), carrying data valid to the whole video sequence, and Picture Parameter Sets (PPS), carrying information valid on a picture by picture base, the new Adaption Parameters Sets (APS) carries picture-adaptive information that is also valid on a picture by picture base but is expected to change (typically much) more frequently than the information in PPS. 1.1.4 NAL Unit Header HEVC maintains the NAL unit concept of H.264 with modifications. HEVC uses a two-byte NAL unit header. Table 1 lists the allocation of NAL unit types for VCL NAL units and non-VCL NAL units. Schierl, et al Expires August 27, 2012 [Page 9] Internet-Draft RTP Payload Format for HEVC February 2012 Table 1. NAL unit types in HEVC Type NAL Unit Name NAL unit type class ---------------------------------------------------------------- 0 Unspecified non-VCL 1 Coded slice of a non-IDR, non-CRA VCL and non-TLA pictures 2 Reserved - 3 Coded slice of a TLA picture VCL 4 Coded slice of a CRA picture VCL 5 Coded slice of an IDR picture VCL 6 Supplemental enhancement information (SEI) non-VCL 7 Sequence parameter set non-VCL 8 Picture parameter set non-VCL 9 Access unit delimiter non-VCL 10..11 Reserved - 12 Filler data non-VCL 13 Reserved - 14 Adaptation parameter set non-VCL 15..23 Reserved - 24..63 unspecified non-VCL The syntax and semantics of the NAL unit header are specified in [HEVC], but the essential properties of the NAL unit header are summarized below for convenience. The first byte of the NAL unit header has the following format: +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |F|N| Type | +---------------+ The semantics of the components of the NAL unit type octets, as specified in [HEVC], are described briefly below. In addition to the name and size of each field, the corresponding syntax element name in [HEVC] is also provided. F: 1 bit forbidden_zero_bit. HEVC declares a value of 1 as a syntax violation. Note: the bit is wasted for compatibility with MPEG-2 transport systems. N: 1 bit nal_ref_flag. A value of 0 indicates that the content of the NAL unit is not used to reconstruct reference pictures for future Schierl, et al Expires August 27, 2012 [Page 10] Internet-Draft RTP Payload Format for HEVC February 2012 prediction. Such NAL units can be discarded without potentially damaging the integrity of the reference pictures. A value of 1 indicates that the decoding of the NAL unit is required to maintain the integrity of reference pictures or that the NAL unit contains a parameter set. Type: 6 bits nal_unit_type. This component specifies the NAL unit type as defined in Table 7-1 of [HEVC], and in Table 1 in this memo. For a reference of all currently defined NAL unit types and their semantics, please refer to Section 7.4.1 in [HEVC]. In NAL units specified by HEVC, the second octet in the NAL unit header is shown below. +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ | TID | R | +---------------+ TID: 3 bits temporal_id. This component indicates the temporal identifier of the NAL unit in the coded sequence. For IDR pictures or CRA pictures the value is 0. For TLA pictures the value of temporal_id must be greater than 0. R: 5 bits reserved_5 bits. Reserved bits for future extension (such as scalability and three-dimension video extensions). R MUST be equal to "00001" (in binary form). Decoders must ignore (i.e. remove from the bitstream and discard) NAL units with values of reserved_one_5bits not equal to '00001'. This memo extends the semantics of F, N, and TID, as described in Section 4.2. 1.2. Overview of the Payload Format This payload format defines the following processes required for transport of HEVC coded data over RTP [RFC3550]: o Usage of RTP header with this payload format o Packetization of HEVC coded NAL units into RTP packets Schierl, et al Expires August 27, 2012 [Page 11] Internet-Draft RTP Payload Format for HEVC February 2012 o Transmission of HEVC NAL units of the same bitstream within a single RTP session or within multiple RTP sessions o Payload format parameters to be used within the Session Description Protocol (SDP) [RFC4566]. 2. Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119 [RFC2119]. This specification uses the notion of setting and clearing a bit when bit fields are handled. Setting a bit is the same as assigning that bit the value of 1 (On). Clearing a bit is the same as assigning that bit the value of 0 (Off). 3. Definitions and Abbreviations 3.1 Definitions This document uses the terms and definitions of [HEVC]. Section 3.1.1 lists relevant definitions copied from [HEVC] for convenience. Section 3.1.2 gives definitions specific to this memo. 3.1.1 Definitions from the HEVC Specification access unit: A set of NAL units that are consecutive in decoding order and contain exactly one coded picture. In addition to the coded slice NAL units of the coded picture, the access unit may also contain other NAL units not containing slices of the coded picture. The decoding of an access unit always results in a decoded picture. coded video sequence: A sequence of access units that consists, in decoding order, of an IDR access unit followed by zero or more non-IDR access units including all subsequent access units up to but not including any subsequent IDR access unit. CRA access unit: An access unit in which the coded picture is a CRA picture. CRA picture: A coded picture containing only I slices and for which each slice has nal_unit_type equal to 4; all coded pictures that follow the Clean Random Access (CRA) picture both in decoding order and output order shall not use inter prediction Schierl, et al Expires August 27, 2012 [Page 12] Internet-Draft RTP Payload Format for HEVC February 2012 from any picture that precedes the CRA picture either in decoding order or output order; and any picture that precedes the CRA picture in decoding order also precedes the CRA picture in output order. IDR access unit: An access unit in which the coded picture is an IDR picture. IDR picture: A coded picture for which the variable IdrPicFlag is equal to 1. An IDR picture causes the decoding process to mark all reference pictures as "unused for reference". All coded pictures that follow an IDR picture in decoding order can be decoded without inter prediction from any picture that precedes the IDR picture in decoding order. The first picture of each coded video sequence in decoding order is an IDR picture. Random Access: The act of starting the decoding process for a bitstream at a point other than the beginning of the stream. Tile: An integer number of treeblocks co-occurring in one column and one row (each of which comprising one or more columns or rows of treeblocks), ordered consecutively in treeblock raster scan of the tile. The division of each picture into tiles is a partitioning. Tiles in a picture are ordered consecutively in tile raster scan of the picture. Although a slice contains treeblocks that are consecutive in treeblock raster scan of a tile, these treeblocks are not necessarily consecutive in treeblock raster scan of the picture. 3.1.2 Definitions Specific to This Memo media aware network element (MANE): A network element, such as a middlebox or application layer gateway that is capable of parsing certain aspects of the RTP payload headers or the RTP payload and reacting to their contents. Informative note: The concept of a MANE goes beyond normal routers or gateways in that a MANE has to be aware of the signaling (e.g., to learn about the payload type mappings of the media streams), and in that it has to be trusted when working with SRTP. The advantage of using MANEs is that they allow packets to be dropped according to the needs of the media coding. For example, if a MANE has to drop packets due to congestion on a certain link, it can identify and remove those packets whose elimination produces the least adverse effect on the user experience. After dropping packets, MANEs Schierl, et al Expires August 27, 2012 [Page 13] Internet-Draft RTP Payload Format for HEVC February 2012 must rewrite RTCP packets to match the changes to the RTP packet stream as specified in Section 7 of [RFC3550]. NAL unit decoding order: A NAL unit order that conforms to the constraints on NAL unit order given in Section 7.4.1.2.3 in [HEVC]. NALU-time: The value that the RTP timestamp would have if the NAL unit would be transported in its own RTP packet. RTP packet stream: A sequence of RTP packets with increasing sequence numbers (except for wrap-around), identical PT and identical SSRC (Synchronization Source), carried in one RTP session. Within the scope of this memo, one RTP packet stream is utilized to transport one or more layers. transmission order: The order of packets in ascending RTP sequence number order (in modulo arithmetic). Within an aggregation packet, the NAL unit transmission order is the same as the order of appearance of NAL units in the packet. 3.2 Abbreviations TBD 4. RTP Payload Format 4.1 RTP Header Usage The format of the RTP header is specified in [RFC3550] and reprinted in Figure 1 for convenience. This payload format uses the fields of the header in a manner consistent with that specification. The RTP payload (and the settings for some RTP header bits) for aggregation packets and fragmentation units are specified in Sections 4.6 and 4.8, respectively. Schierl, et al Expires August 27, 2012 [Page 14] Internet-Draft RTP Payload Format for HEVC February 2012 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | contributing source (CSRC) identifiers | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 RTP header according to [RFC3550] The RTP header information to be set according to this RTP payload format is set as follows: Marker bit (M): 1 bit Set for the very last packet of the access unit indicated by the RTP timestamp, in line with the normal use of the M bit in video formats, to allow an efficient playout buffer handling. For aggregation packets (STAP), the marker bit in the RTP header MUST be set to the value that the marker bit of the last NAL unit of the aggregation packet would have been if it were transported in its own RTP packet. Decoders MAY use this bit as an early indication of the last packet of an access unit but MUST NOT rely on this property. Informative note: Only one M bit is associated with an aggregation packet carrying multiple NAL units. Thus, if a gateway has re-packetized an aggregation packet into several packets, it cannot reliably set the M bit of those packets. Payload type (PT): 7 bits The assignment of an RTP payload type for this new packet format is outside the scope of this document and will not be specified here. The assignment of a payload type has to be performed either through the profile used or in a dynamic way. Schierl, et al Expires August 27, 2012 [Page 15] Internet-Draft RTP Payload Format for HEVC February 2012 Sequence number (SN): 16 bits Set and used in accordance with RFC 3550. In some packetization modes (list TBD), the sequence number is used to determine decoding order for the NALUs. Timestamp: 32 bits The RTP timestamp is set to the sampling timestamp of the content. A 90 kHz clock rate MUST be used. If the NAL unit has no timing properties of its own (e.g., parameter set and SEI NAL units), the RTP timestamp is set to the RTP timestamp of the coded picture of the access unit in which the NAL unit is included, according to Section 7.4.1.2.3 of [HEVC]. Receivers SHOULD ignore any picture timing SEI messages included in access units that have only one display timestamp. Instead, receivers SHOULD use the RTP timestamp for synchronizing the display process. If one access unit has more than one display timestamp carried in a picture timing SEI message, then the information in the SEI message SHOULD be treated as relative to the RTP timestamp, with the earliest event occurring at the time given by the RTP timestamp and subsequent events later, as given by the difference in picture time values carried in the picture timing SEI message. Let tSEI1, tSEI2, ..., tSEIn be the display timestamps carried in the SEI message of an access unit, where tSEI1 is the earliest of all such timestamps. Let tmadjst() be a function that adjusts the SEI messages time scale to a 90-kHz time scale. Let TS be the RTP timestamp. Then, the display time for the event associated with tSEI1 is TS. The display time for the event with tSEIx, where x is [2..n], is TS + tmadjst (tSEIx - tSEI1). 4.2 NAL Unit Header Usage The structure and semantics of the NAL unit header according to the HEVC specification [HEVC] were introduced in Section 1.1.4. This section specifies the extended semantics of the NAL unit header fields. 4.3 Payload Structures The NAL unit structure is central to HEVC [HEVC], all HEVC coded bits for representing a video signal are encapsulated in NAL units. Therefore each RTP packet payload is structured as a NAL unit, which Schierl, et al Expires August 27, 2012 [Page 16] Internet-Draft RTP Payload Format for HEVC February 2012 contains one or a part of one NAL unit specified in HEVC, or aggregates one or more NAL units specified in HEVC. 4.4 Transmission Modes This memo enables transmission of an HEVC bitstream over a single RTP session or multiple RTP sessions. TBD: SSRC Muxing for video conf. + TV broadcast/multicast. 4.5 Packetization Modes This memo specifies the following packetization modes: o Non-interleaved mode o Interleaved mode In the non-interleaved mode, NAL units are transmitted in NAL unit decoding order. The interleaved mode allows transmission of NAL units out of NAL unit decoding order. The packetization mode in use MAY be signaled by the value of the OPTIONAL packetization-mode media type parameter. The used packetization mode governs which NAL unit types are allowed in RTP payloads. Table 2 summarizes the allowed packet payload types for each packetization mode. Packetization modes are explained in more detail in section 6. Table 2. Summary of allowed NAL unit types for each packetization mode (yes = allowed, no = disallowed, ig = ignore) Payload Packet Non-Interleaved Interleaved Type Type Mode Mode ------------------------------------------------- 0 reserved ig ig 1-23 NAL unit yes no 24 STAP-A yes no 25 STAP-B no yes 26 FU-A yes yes 27 FU-B no yes 28-63 reserved ig ig Some NAL unit or payload type values (indicated as reserved in Table 2) are reserved for future extensions. NAL units of those types SHOULD NOT be sent by a sender (direct as packet payloads, or as aggregation units in aggregation packets, or as fragmented units Schierl, et al Expires August 27, 2012 [Page 17] Internet-Draft RTP Payload Format for HEVC February 2012 in FU packets) and MUST be ignored by a receiver. For example, the payload types 1-23, with the associated packet type "NAL unit", are allowed in "Non-Interleaved Mode", but disallowed in "Interleaved Mode". However, NAL units of NAL unit types 1-23 can be used in "Interleaved Mode" as aggregation units in STAP-B packets as well as fragmented units in FU-A and FU-B packets. Similarly, NAL units of NAL unit types 1-23 can also be used in the "Non-Interleaved Mode" as aggregation units in STAP-A packets or fragmented units in FU-A packets, in addition to being directly used as packet payloads. 4.6 Decoding Order In the interleaved packetization mode, the transmission order of NAL units is allowed to differ from the decoding order of the NAL units. Decoding order number (DON) is a field in the payload structure or a derived variable that indicates the NAL unit decoding order. Rationale and examples of use cases for transmission out of decoding order and for the use of DON are given in section 13. The coupling of transmission and decoding order is controlled by the OPTIONAL sprop-interleaving-depth media type parameter as follows. When the value of the OPTIONAL sprop-interleaving-depth media type parameter is equal to 0 (explicitly or per default), the transmission order of NAL units MUST conform to the NAL unit decoding order. When the value of the OPTIONAL sprop-interleaving- depth media type parameter is greater than 0, o the order of NAL units generated by de-packetizing STAP-Bs, and FUs in two consecutive packets is NOT REQUIRED to be the NAL unit decoding order. The RTP payload structures for an STAP-A, and an FU-A do not include DON. STAP-B and FU-B structures include DON. Informative note: When an FU-A occurs in interleaved mode, it always follows an FU-B, which sets its DON. Informative note: If a transmitter wants to encapsulate a single NAL unit per packet and transmit packets out of their decoding order, STAP-B packet type can be used. In the non-interleaved packetization mode, the transmission order of NAL units in single NAL unit packets, STAP-As, and FU-As MUST be the same as their NAL unit decoding order. The NAL units within an STAP MUST appear in the NAL unit decoding order. Thus, the decoding order is first provided through the implicit order within a STAP, Schierl, et al Expires August 27, 2012 [Page 18] Internet-Draft RTP Payload Format for HEVC February 2012 and second provided through the RTP sequence number for the order between STAPs, FUs, and single NAL unit packets. Signaling of the value of DON for NAL units carried in STAP-B, and a series of fragmentation units starting with an FU-B is specified in sections 4.7.1, and 4.8, respectively. The DON value of the first NAL unit in transmission order MAY be set to any value. Values of DON are in the range of 0 to 65535, inclusive. After reaching the maximum value, the value of DON wraps around to 0. The decoding order of two NAL units contained in any STAP-B, or a series of fragmentation units starting with an FU-B is determined as follows. Let DON(i) be the decoding order number of the NAL unit having index i in the transmission order. Function don_diff(m,n) is specified as follows: If DON(m) == DON(n), don_diff(m,n) = 0 If (DON(m) < DON(n) and DON(n) - DON(m) < 32768), don_diff(m,n) = DON(n) - DON(m) If (DON(m) > DON(n) and DON(m) - DON(n) >= 32768), don_diff(m,n) = 65536 - DON(m) + DON(n) If (DON(m) < DON(n) and DON(n) - DON(m) >= 32768), don_diff(m,n) = - (DON(m) + 65536 - DON(n)) If (DON(m) > DON(n) and DON(m) - DON(n) < 32768), don_diff(m,n) = - (DON(m) - DON(n)) A positive value of don_diff(m,n) indicates that the NAL unit having transmission order index n follows, in decoding order, the NAL unit having transmission order index m. When don_diff(m,n) is equal to 0, then the NAL unit decoding order of the two NAL units can be in either order. A negative value of don_diff(m,n) indicates that the NAL unit having transmission order index n precedes, in decoding order, the NAL unit having transmission order index m. Values of the DON field MUST be such that the decoding order determined by the values of DON, as specified above, conforms to the NAL unit decoding order. If the order of two NAL units in NAL unit decoding order is switched and the new order does not conform to the NAL unit decoding order, the NAL units MUST NOT have the same value of DON. If the order of two consecutive NAL units in the NAL unit stream is switched and the new order still conforms to the NAL unit decoding order, the NAL units MAY have the same value of DON. Consequently, NAL units having the same value of DON can be decoded Schierl, et al Expires August 27, 2012 [Page 19] Internet-Draft RTP Payload Format for HEVC February 2012 in any order, and two NAL units having a different value of DON should be passed to the decoder in the order specified above. When two consecutive NAL units in the NAL unit decoding order have a different value of DON, the value of DON for the second NAL unit in decoding order SHOULD be the value of DON for the first, incremented by one. An example of the de-packetization process to recover the NAL unit decoding order is given in section 7. Informative note: Receivers should not expect that the absolute difference of values of DON for two consecutive NAL units in the NAL unit decoding order will be equal to one, even in error-free transmission. An increment by one is not required, as at the time of associating values of DON to NAL units, it may not be known whether all NAL units are delivered to the receiver. For example, a gateway may not forward coded slice NAL units of non- reference pictures or SEI NAL units when there is a shortage of bit rate in the network to which the packets are forwarded. In another example, a live broadcast is interrupted by pre-encoded content, such as commercials, from time to time. The first intra picture of a pre-encoded clip is transmitted in advance to ensure that it is readily available in the receiver. When transmitting the first intra picture, the originator does not exactly know how many NAL units will be encoded before the first intra picture of the pre-encoded clip follows in decoding order. Thus, the values of DON for the NAL units of the first intra picture of the pre- encoded clip have to be estimated when they are transmitted, and gaps in values of DON may occur. 4.7 Aggregation Packets Aggregation packets are the NAL unit aggregation scheme of this payload specification. The scheme is introduced to reflect the dramatically different MTU sizes of two key target networks: wireline IP networks (with an MTU size that is often limited by the Ethernet MTU size; roughly 1500 bytes), and IP or non-IP (e.g., ITU- T H.324/M) based wireless communication systems with preferred transmission unit sizes of 254 bytes or less. To prevent media transcoding between the two worlds, and to avoid undesirable packetization overhead, a NAL unit aggregation scheme is introduced. The Single-time aggregation packet (STAP) is defined by this specification: Schierl, et al Expires August 27, 2012 [Page 20] Internet-Draft RTP Payload Format for HEVC February 2012 o Single-time aggregation packet (STAP): aggregates NAL units with identical NALU-time. Two types of STAPs are defined, one without DON (STAP-A) and another including DON (STAP-B). Each NAL unit to be carried in an aggregation packet is encapsulated in an aggregation unit. The structure of the RTP payload format for aggregation packets is presented in Figure 2. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F|NRI| Type | | +-+-+-+-+-+-+-+-+ | | | | one or more aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2 RTP payload format for aggregation packets STAPs do have the following packetization rules: The type field of the NAL unit type octet MUST be set to the appropriate value for STAP, as indicated in Table 2. The F bit MUST be cleared if all F bits of the aggregated NAL units are zero; otherwise, it MUST be set. The value of NRI MUST be the maximum of all the NAL units carried in the aggregation packet. The marker bit in the RTP header is set to the value that the marker bit of the last NAL unit of the aggregated packet would have if it were transported in its own RTP packet. The payload of an aggregation packet consists of one or more aggregation units. See sections 4.7.1 for the single time aggregation unit. An aggregation packet can carry as many aggregation units as necessary; however, the total amount of data in an aggregation packet obviously MUST fit into an IP packet, and the size SHOULD be chosen so that the resulting IP packet is smaller than the MTU size. An aggregation packet MUST NOT contain fragmentation units specified in section 4.8. Aggregation packets MUST NOT be nested; i.e., an aggregation packet MUST NOT contain another aggregation packet. Schierl, et al Expires August 27, 2012 [Page 21] Internet-Draft RTP Payload Format for HEVC February 2012 4.7.1 Single Time Aggregation Packet (STAP) Single-time aggregation packet (STAP) SHOULD be used whenever NAL units are aggregated that all share the same NALU-time. The payload of an STAP consists of at least one single-time aggregation unit, as presented in Figure 3. The payload of an STAP-B consists of a 16-bit unsigned decoding order number (DON) (in network byte order) followed by at least one single-time aggregation unit, as presented in Figure 4. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | +-+-+-+-+-+-+-+-+ | | | | single-time aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3 Payload format for STAP-A 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : decoding order number (DON) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | single-time aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4 Payload format for STAP-B The DON field specifies the value of DON for the first NAL unit in an STAP-B in transmission order. For each successive NAL unit in appearance order in an STAP-B, the value of DON is equal to (the value of DON of the previous NAL unit in the STAP-B + 1) % 65536, in which '%' stands for the modulo operation. Schierl, et al Expires August 27, 2012 [Page 22] Internet-Draft RTP Payload Format for HEVC February 2012 A single-time aggregation unit consists of 16-bit unsigned size information (in network byte order) that indicates the size of the following NAL unit in bytes (excluding these two octets, but including the NAL unit type octet of the NAL unit), followed by the NAL unit itself, including its NAL unit type byte. A single-time aggregation unit is byte aligned within the RTP payload, but it may not be aligned on a 32-bit word boundary. Figure 5 presents the structure of the single-time aggregation unit. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : NAL unit size | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | NAL unit | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 5 Structure for single-time aggregation unit (STAU) Figure 6 presents an example of an RTP packet that contains an STAP- A. The STAP-A contains two single-time aggregation units, labeled as 1 and 2 in the figure. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |STAP NAL HDR | NALU 1 Size | NALU 1 HDR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 1 HDR | NALU 1 Data | +-+-+-+-+-+-+-+-+ | : : | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | NALU 2 Size | NALU 2 HDR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 2 HDR | NALU 2 Data | +-+-+-+-+-+-+-+-+ : | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Schierl, et al Expires August 27, 2012 [Page 23] Internet-Draft RTP Payload Format for HEVC February 2012 Figure 6 An example of an RTP packet including an STAP-A containing two single-time aggregation units Figure 7 presents an example of an RTP packet that contains an STAP- B. The STAP contains two single-time aggregation units, labeled as 1 and 2 in the figure. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |STAP-B NAL HDR | DON | NALU 1 Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 1 Size | NALU 1 HDR | NALU 1 Data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + : : + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | NALU 2 Size | NALU 2 HDR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 2 HDR | NALU 2 Data | +-+-+-+-+-+-+-+-+ : | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 7 An example of an RTP packet including an STAP-B containing two single-time aggregation units 4.8 Fragmentation Units (FUs) This payload type allows fragmenting a NAL unit into several RTP packets. Doing so on the application layer instead of relying on lower layer fragmentation (e.g., by IP) may have the following use cases: o The payload format is capable of transporting NAL units bigger than 64 kbytes over an IPv4 network that may be present in pre- recorded video, particularly in High Definition formats (there is a limit of the number of slices per picture, which results in a limit of NAL units per picture, which may result in big NAL units). o The fragmentation mechanism allows fragmenting a single NAL unit and applying generic forward error correction. Schierl, et al Expires August 27, 2012 [Page 24] Internet-Draft RTP Payload Format for HEVC February 2012 Fragmentation is defined only for a single NAL unit and not for any aggregation packets. A fragment of a NAL unit consists of an integer number of consecutive octets of that NAL unit. Each octet of the NAL unit MUST be part of exactly one fragment of that NAL unit. Fragments of the same NAL unit MUST be sent in consecutive order with ascending RTP sequence numbers (with no other RTP packets within the same RTP packet stream being sent between the first and last fragment). Similarly, a NAL unit MUST be reassembled in RTP sequence number order. When a NAL unit is fragmented and conveyed within fragmentation units (FUs), it is referred to as a fragmented NAL unit. STAPs MUST NOT be fragmented. FUs MUST NOT be nested; i.e., an FU MUST NOT contain another FU. The RTP timestamp of an RTP packet carrying an FU is set to the NALU-time of the fragmented NAL unit. Figure 8 presents the RTP payload format for FU-A. An FU-A consists of a fragmentation unit indicator of one octet, a fragmentation unit header of one octet, and a fragmentation unit payload. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FU NAL HDR | FU header | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | FU payload | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 8 RTP payload format for FU-A Figure 9 presents the RTP payload format for FU-Bs. An FU-B consists of a fragmentation unit indicator of one octet, a fragmentation unit header of one octet, a decoding order number (DON) (in network byte order), and a fragmentation unit payload. In other words, the structure of FU-B is the same as the structure of FU-A, except for the additional DON field. Schierl, et al Expires August 27, 2012 [Page 25] Internet-Draft RTP Payload Format for HEVC February 2012 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FU indicator | FU header | DON | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| | | | FU payload | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 9 RTP payload format for FU-B NAL unit type FU-B MUST be used in the interleaved packetization mode for the first fragmentation unit of a fragmented NAL unit. NAL unit type FU-B MUST NOT be used in any other case. In other words, in the interleaved packetization mode, each NALU that is fragmented has an FU-B as the first fragment, followed by one or more FU-A fragments. The FU NAL HDR octet has the following format: +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |F|N| Type | +---------------+ A value equal to 26 in the Type field of the FU indicator octet identifies an FU-A packet and a value of 27 identifies an FU-B packet. The use of the F bit is described in section 5. The value of the N field MUST be set according to the value of the N field in the fragmented NAL unit. The FU header has the following format: +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |S|E| Type | +---------------+ S: 1 bit When set to one, the Start bit indicates the start of a Schierl, et al Expires August 27, 2012 [Page 26] Internet-Draft RTP Payload Format for HEVC February 2012 fragmented NAL unit. When the following FU payload is not the start of a fragmented NAL unit payload, the Start bit is set to zero. E: 1 bit When set to one, the End bit indicates the end of a fragmented NAL unit, i.e., the last byte of the payload is also the last byte of the fragmented NAL unit. When the following FU payload is not the last fragment of a fragmented NAL unit, the End bit is set to zero. Type: 6 bits The NAL unit payload type as defined in Table 7-1 of [HEVC]. The value of DON in FU-Bs is selected as described in section 4.6. Informative note: The DON field in FU-Bs allows gateways to fragment NAL units to FU-Bs without organizing the incoming NAL units to the NAL unit decoding order. A fragmented NAL unit MUST NOT be transmitted in one FU; i.e., the Start bit and End bit MUST NOT both be set to one in the same FU header. The FU payload consists of fragments of the payload of the fragmented NAL unit so that if the fragmentation unit payloads of consecutive FUs are sequentially concatenated, the payload of the fragmented NAL unit can be reconstructed. The NAL unit type octet of the fragmented NAL unit is not included as such in the fragmentation unit payload, but rather the information of the NAL unit type octet of the fragmented NAL unit is conveyed in F and N fields of the FU indicator octet of the fragmentation unit and in the type field of the FU header. An FU payload MAY have any number of octets and MAY be empty. If a fragmentation unit is lost, the receiver SHOULD discard all following fragmentation units in transmission order corresponding to the same fragmented NAL unit. A receiver in an endpoint or in a MANE MAY aggregate the first n-1 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment n of that NAL unit is not received. In this case, the forbidden_zero_bit of the NAL unit MUST be set to one to indicate a syntax violation. Schierl, et al Expires August 27, 2012 [Page 27] Internet-Draft RTP Payload Format for HEVC February 2012 5. Packetization Rules The packetization modes are introduced in section 4.5. The packetization rules common to more than one of the packetization modes are specified in section 5.1. The packetization rules for the non-interleaved mode are specified in section 5.2, and the packetization rules for the interleaved mode are specified in sections 5.3. 5.1 Common Packetization Rules All senders MUST enforce the following packetization rules regardless of the packetization mode in use: o VCL NAL units belonging to the same coded picture (and thus sharing the same RTP timestamp value) SHOULD be sent in their original decoding order to minimize the delay. Note that the decoding order is the order of the NAL units in the bitstream. o Parameter sets are handled in accordance with the rules and recommendations given in section 7.4. o MANEs MUST NOT duplicate any NAL unit except for sequence or picture parameter set NAL units, as neither this memo nor the HEVC specification provides means to identify duplicated NAL units. Sequence and picture parameter set NAL units MAY be duplicated to make their correct reception more probable, but any such duplication MUST NOT affect the contents of any active sequence or picture parameter set. Duplication SHOULD be performed on the application layer and not by duplicating RTP packets (with identical sequence numbers). Senders using the non-interleaved mode and the interleaved mode MUST enforce the following packetization rule: o MANEs MAY convert single NAL unit packets into one aggregation packet, convert an aggregation packet into several single NAL unit packets, or mix both concepts, in an RTP translator. The RTP translator SHOULD take into account at least the following parameters: path MTU size, unequal protection mechanisms (e.g., through packet-based FEC according to [RFC5109], especially for sequence and picture parameter set NAL units and coded slice data partition A NAL units), bearable latency of the system, and buffering capabilities of the receiver. Schierl, et al Expires August 27, 2012 [Page 28] Internet-Draft RTP Payload Format for HEVC February 2012 Informative note: An RTP translator is required to handle RTCP as per [RFC3550]. 5.2 Non-Interleaved mode This mode MUST be supported. This mode is in use when the value of the OPTIONAL packetization-mode media type parameter is equal to 1. It is primarily intended for low-delay applications. Only single NAL unit packets, STAPs, and FUs MAY be used in this mode. The transmission order of NAL units MUST comply with the NAL unit decoding order. 5.3 Interleaved mode This mode is in use when the value of the OPTIONAL packetization- mode media type parameter is equal to 2. Some receivers MAY support this mode. STAP-Bs, FU-As, and FU-Bs MAY be used. STAP-As and single NAL unit packets MUST NOT be used. The transmission order of packets and NAL units is constrained as specified in section 4.6. 6. De-Packetization Process The de-packetization process is implementation dependent. Therefore, the following description should be seen as an example of a suitable implementation. Other schemes may be used as well as long as the output for the same input is the same as the process described below. The output is the same meaning that the number of NAL units and their order are both the identical. Optimizations relative to the described algorithms are likely possible. Section 6.1 presents the de-packetization process for the non-interleaved packetization mode and section 6.2 presents the de-packetization process for the interleaved packetization mode. All normal RTP mechanisms related to buffer management apply. In particular, duplicated or outdated RTP packets (as indicated by the RTP sequences number and the RTP timestamp) are removed. To determine the exact time for decoding, factors such as a possible intentional delay to allow for proper inter-stream synchronization must be factored in. Schierl, et al Expires August 27, 2012 [Page 29] Internet-Draft RTP Payload Format for HEVC February 2012 6.1 Non-Interleaved Mode The receiver includes a receiver buffer to compensate for transmission delay jitter. The receiver stores incoming packets in reception order into the receiver buffer. Packets are de-packetized in RTP sequence number order. If a de-packetized packet is a single NAL unit packet, the NAL unit contained in the packet is passed directly to the decoder. If a de-packetized packet is an STAP-A, the NAL units contained in the packet are passed to the decoder in the order in which they are encapsulated in the packet. For all the FU-A packets containing fragments of a single NAL unit, the de- packetized fragments are concatenated in their sending order to recover the NAL unit, which is then passed to the decoder. 6.2 Interleaved Mode The general concept behind these de-packetization rules is to reorder NAL units from transmission order to the NAL unit decoding order. The receiver includes a receiver buffer, which is used to compensate for transmission delay jitter and to reorder NAL units from transmission order to the NAL unit decoding order. In this section, the receiver operation is described under the assumption that there is no transmission delay jitter. To make a difference from a practical receiver buffer that is also used for compensation of transmission delay jitter, the receiver buffer is here after called the de-interleaving buffer in this section. Receivers SHOULD also prepare for transmission delay jitter; i.e., either reserve separate buffers for transmission delay jitter buffering and de-interleaving buffering or use a receiver buffer for both transmission delay jitter and de-interleaving. Moreover, receivers SHOULD take transmission delay jitter into account in the buffering operation; e.g., by additional initial buffering before starting of decoding and playback. This section is organized as follows: subsection 6.2.1 presents how to calculate the size of the de-interleaving buffer. Subsection 6.2.2 specifies the receiver process how to organize received NAL units to the NAL unit decoding order. 6.2.1 Size of the De-interleaving Buffer When the SDP Offer/Answer model or any other capability exchange procedure is used in session setup, the properties of the received stream SHOULD be such that the receiver capabilities are not exceeded. In the SDP Offer/Answer model, the receiver can indicate Schierl, et al Expires August 27, 2012 [Page 30] Internet-Draft RTP Payload Format for HEVC February 2012 its capabilities to allocate a de-interleaving buffer with the deint-buf-cap media type parameter. The sender indicates the requirement for the de-interleaving buffer size with the sprop- deint-buf-req media type parameter. It is therefore RECOMMENDED to set the de-interleaving buffer size, in terms of number of bytes, equal to or greater than the value of sprop-deint-buf-req media type parameter. See section 8.1 for further information on deint-buf-cap and sprop-deint-buf-req media type parameters and section 8.2.2 for further information on their use in the SDP Offer/Answer model. When a declarative session description is used in session setup, the sprop-deint-buf-req media type parameter signals the requirement for the de-interleaving buffer size. It is therefore RECOMMENDED to set the de-interleaving buffer size, in terms of number of bytes, equal to or greater than the value of sprop-deint-buf-req media type parameter. 6.2.2 De-interleaving Process There are two buffering states in the receiver: initial buffering and buffering while playing. Initial buffering occurs when the RTP session is initialized. After initial buffering, decoding and playback are started, and the buffering-while-playing mode is used. Regardless of the buffering state, the receiver stores incoming NAL units, in reception order, in the de-interleaving buffer as follows. NAL units of aggregation packets are stored in the de-interleaving buffer individually. The value of DON is calculated and stored for each NAL unit. The receiver operation is described below with the help of the following functions and constants: o Function AbsDON is specified in section 7.1. o Function don_diff is specified in section 4.6. o Constant N is the value of the OPTIONAL sprop-interleaving-depth media type type parameter (see section 7.1) incremented by 1. Initial buffering lasts until one of the following conditions is fulfilled: o There are N or more VCL NAL units in the de-interleaving buffer. Schierl, et al Expires August 27, 2012 [Page 31] Internet-Draft RTP Payload Format for HEVC February 2012 o If sprop-max-don-diff is present, don_diff(m,n) is greater than the value of sprop-max-don-diff, in which n corresponds to the NAL unit having the greatest value of AbsDON among the received NAL units and m corresponds to the NAL unit having the smallest value of AbsDON among the received NAL units. o Initial buffering has lasted for the duration equal to or greater than the value of the OPTIONAL sprop-init-buf-time media type parameter. The NAL units to be removed from the de-interleaving buffer are determined as follows: o If the de-interleaving buffer contains at least N VCL NAL units, NAL units are removed from the de-interleaving buffer and passed to the decoder in the order specified below until the buffer contains N-1 VCL NAL units. o If sprop-max-don-diff is present, all NAL units m for which don_diff(m,n) is greater than sprop-max-don-diff are removed from the de-interleaving buffer and passed to the decoder in the order specified below. Herein, n corresponds to the NAL unit having the greatest value of AbsDON among the NAL units in the de- interleaving buffer. The order in which NAL units are passed to the decoder is specified as follows: o Let PDON be a variable that is initialized to 0 at the beginning of the RTP session. o For each NAL unit associated with a value of DON, a DON distance is calculated as follows. If the value of DON of the NAL unit is larger than the value of PDON, the DON distance is equal to DON - PDON. Otherwise, the DON distance is equal to 65535 - PDON + DON + 1. o NAL units are delivered to the decoder in ascending order of DON distance. If several NAL units share the same value of DON distance, they can be passed to the decoder in any order. o When a desired number of NAL units have been passed to the decoder, the value of PDON is set to the value of DON for the last NAL unit passed to the decoder. Schierl, et al Expires August 27, 2012 [Page 32] Internet-Draft RTP Payload Format for HEVC February 2012 6.3 Additional De-Packetization Guidelines The following additional de-packetization rules may be used to implement an operational HEVC de-packetizer: o Intelligent RTP receivers (e.g., in gateways) may identify lost FUs. If a lost FU is found, a gateway may decide not to send the following FUs of the same fragmented NAL unit, as their information is meaningless for HEVC decoders. In this way a MANE can reduce network load by discarding useless packets without parsing a complex bitstream. o Intelligent receivers having to discard packets or NALUs should first discard all packets/NALUs in which the value of the NRI field of the NAL unit type octet is equal to 0. This will minimize the impact on user experience and keep the reference pictures intact. If more packets have to be discarded, then packets with a NRI value equal to zero may be discarded before packets with a a higher NRI value. However, discarding any packets with an NRI not equal to zero very likely leads to decoder drift and SHOULD be avoided. 7. Payload Format Parameters This section specifies the parameters that MAY be used to select optional features of the payload format and certain features of the bitstream. The parameters are specified here as part of the media type registration for the HEVC codec. A mapping of the parameters into the Session Description Protocol (SDP) [RFC4566] is also provided for applications that use SDP. Equivalent parameters could be defined elsewhere for use with control protocols that do not use SDP. Some parameters provide a receiver with the properties of the stream that will be sent. The names of all these parameters start with "sprop" for stream properties. Some of these "sprop" parameters are limited by other payload or codec configuration parameters. For example, the sprop-parameter-sets parameter is constrained by the profile-level-id parameter. The media sender selects all "sprop" parameters rather than the receiver. This uncommon characteristic of the "sprop" parameters may be incompatible with some signaling protocol concepts, in which case the use of these parameters SHOULD be avoided. Schierl, et al Expires August 27, 2012 [Page 33] Internet-Draft RTP Payload Format for HEVC February 2012 7.1 Media Type Registration The media subtype for the HEVC codec is allocated from the IETF tree. The receiver MUST ignore any unspecified parameter. Media Type name: video Media subtype name: H265 Required parameters: none OPTIONAL parameters: In the following definitions of parameters, "the stream" or "the NAL unit stream" refers to all NAL units conveyed in the current RTP session in SST, and all NAL units conveyed in the current RTP session and all NAL units conveyed in other RTP sessions that the current RTP session depends on in MST. profile-level-id: TBD sprop-parameter-sets: TBD max-mbps, max-smbps, max-fs, max-cpb, max-dpb, and max-br: TBD max-mbps: TBD max-smbps: TBD max-fs: TBD max-cpb: TBD max-dpb: TBD Schierl, et al Expires August 27, 2012 [Page 34] Internet-Draft RTP Payload Format for HEVC February 2012 max-br: TBD redundant-pic-cap: TBD sprop-level-parameter-sets: TBD use-level-src-parameter-sets: TBD packetization-mode: This parameter signals the properties of an RTP payload type or the capabilities of a receiver implementation. Only a single configuration point can be indicated; thus, when capabilities to support more than one packetization-mode are declared, multiple configuration points (RTP payload types) must be used. When the value of packetization-mode is equal to 1, the non- interleaved mode, as defined in section 5.2 MUST be used. When the value of packetization-mode is equal to 2, the interleaved mode, as defined in section 5.3, MUST be used. The value of packetization-mode MUST be an integer in the range of 1 to 2, inclusive. sprop-interleaving-depth: This parameter MUST NOT be present when packetization-mode is not present or the value of packetization-mode is equal to 0 or 1. This parameter MUST be present when the value of packetization-mode is equal to 2. This parameter signals the properties of an RTP packet stream. It specifies the maximum number of VCL NAL units that precede any VCL NAL unit in the RTP packet stream in transmission order and follow the VCL NAL unit in decoding order. Consequently, it is guaranteed that receivers can reconstruct NAL unit decoding order when the buffer size for NAL unit decoding order recovery is at least the value of sprop- interleaving-depth + 1 in terms of VCL NAL units. The value of sprop-interleaving-depth MUST be an integer in the range of 0 to 32767, inclusive. sprop-deint-buf-req: This parameter MUST NOT be present when packetization-mode is Schierl, et al Expires August 27, 2012 [Page 35] Internet-Draft RTP Payload Format for HEVC February 2012 not present or the value of packetization-mode is not equal to 2. It MUST be present when the value of packetization-mode is equal to 2. sprop-deint-buf-req signals the required size of the de- interleaving buffer for the RTP packet stream. The value of the parameter MUST be greater than or equal to the maximum buffer occupancy (in units of bytes) required in such a de- interleaving buffer that is specified in section 6.2. It is guaranteed that receivers can perform the de-interleaving of interleaved NAL units into NAL unit decoding order, when the de-interleaving buffer size is at least the value of sprop- deint-buf-req in terms of bytes. The value of sprop-deint-buf-req MUST be an integer in the range of 0 to 4294967295, inclusive. Informative note: sprop-deint-buf-req indicates the required size of the de-interleaving buffer only. When network jitter can occur, an appropriately sized jitter buffer has to be provisioned for as well. deint-buf-cap: This parameter signals the capabilities of a receiver implementation and indicates the amount of de-interleaving buffer space in units of bytes that the receiver has available for reconstructing the NAL unit decoding order. A receiver is able to handle any stream for which the value of the sprop- deint-buf-req parameter is smaller than or equal to this parameter. If the parameter is not present, then a value of 0 MUST be used for deint-buf-cap. The value of deint-buf-cap MUST be an integer in the range of 0 to 4294967295, inclusive. Informative note: deint-buf-cap indicates the maximum possible size of the de-interleaving buffer of the receiver only. When network jitter can occur, an appropriately sized jitter buffer has to be provisioned for as well. sprop-init-buf-time: This parameter MAY be used to signal the properties of an RTP packet stream. The parameter MUST NOT be present, if the value of packetization-mode is equal to 1. The parameter signals the initial buffering time that a receiver MUST wait before starting decoding to recover the NAL Schierl, et al Expires August 27, 2012 [Page 36] Internet-Draft RTP Payload Format for HEVC February 2012 unit decoding order from the transmission order. The parameter is the maximum value of (decoding time of the NAL unit - transmission time of a NAL unit), assuming reliable and instantaneous transmission, the same timeline for transmission and decoding, and that decoding starts when the first packet arrives. An example of specifying the value of sprop-init-buf-time follows. A NAL unit stream is sent in the following interleaved order, in which the value corresponds to the decoding time and the transmission order is from left to right: 0 2 1 3 5 4 6 8 7 ... Assuming a steady transmission rate of NAL units, the transmission times are: 0 1 2 3 4 5 6 7 8 ... Subtracting the decoding time from the transmission time column-wise results in the following series: 0 -1 1 0 -1 1 0 -1 1 ... Thus, in terms of intervals of NAL unit transmission times, the value of sprop-init-buf-time in this example is 1. The parameter is coded as a non-negative base10 integer representation in clock ticks of a 90-kHz clock. If the parameter is not present, then no initial buffering time value is defined. Otherwise the value of sprop-init-buf-time MUST be an integer in the range of 0 to 4294967295, inclusive. In addition to the signaled sprop-init-buf-time, receivers SHOULD take into account the transmission delay jitter buffering, including buffering for the delay jitter caused by mixers, translators, gateways, proxies, traffic-shapers, and other network elements. sprop-max-don-diff: This parameter MAY be used to signal the properties of an RTP packet stream. It MUST NOT be used to signal transmitter or receiver or codec capabilities. The parameter MUST NOT be present if the value of packetization-mode is equal to 1. sprop-max-don-diff is an integer in the range of 0 to 32767, inclusive. If sprop-max-don-diff is not present, the value of Schierl, et al Expires August 27, 2012 [Page 37] Internet-Draft RTP Payload Format for HEVC February 2012 the parameter is unspecified. sprop-max-don-diff is calculated as follows: sprop-max-don-diff = max{AbsDON(i) - AbsDON(j)}, for any i and any j>i, where i and j indicate the index of the NAL unit in the transmission order and AbsDON denotes a decoding order number of the NAL unit that does not wrap around to 0 after 65535. In other words, AbsDON is calculated as follows: Let m and n be consecutive NAL units in transmission order. For the very first NAL unit in transmission order (whose index is 0), AbsDON(0) = DON(0). For other NAL units, AbsDON is calculated as follows: If DON(m) == DON(n), AbsDON(n) = AbsDON(m) If (DON(m) < DON(n) and DON(n) - DON(m) < 32768), AbsDON(n) = AbsDON(m) + DON(n) - DON(m) If (DON(m) > DON(n) and DON(m) - DON(n) >= 32768), AbsDON(n) = AbsDON(m) + 65536 - DON(m) + DON(n) If (DON(m) < DON(n) and DON(n) - DON(m) >= 32768), AbsDON(n) = AbsDON(m) - (DON(m) + 65536 - DON(n)) If (DON(m) > DON(n) and DON(m) - DON(n) < 32768), AbsDON(n) = AbsDON(m) - (DON(m) - DON(n)) where DON(i) is the decoding order number of the NAL unit having index i in the transmission order. The decoding order number is specified in section 4.6. Informative note: Receivers may use sprop-max-don-diff to trigger which NAL units in the receiver buffer can be passed to the decoder. max-rcmd-nalu-size: TBD sar-understood: TBD sar-supported: TBD Schierl, et al Expires August 27, 2012 [Page 38] Internet-Draft RTP Payload Format for HEVC February 2012 Encoding considerations: This type is only defined for transfer via RTP (RFC 3550). Security considerations: See Section 8 of RFC XXXX. Public specification: Please refer to Section 13 of RFC XXXX. Additional information: None File extensions: none Macintosh file type code: none Object identifier or OID: none Person & email address to contact for further information: Thomas Schierl, ts@thomas-schierl.de Intended usage: COMMON Author: Thomas Schierl, ts@thomas-schierl.de Change controller: IETF Audio/Video Transport Payloads working group delegated from the IESG. 7.2 SDP Parameters 7.2.1 Mapping of Payload Type Parameters to SDP TBD 7.2.2 Usage with the SDP Offer/Answer Model The media type video/H265 string is mapped to fields in the Session Description Protocol (SDP) [RFC4566] as follows: o The media name in the "m=" line of SDP MUST be video. o The encoding name in the "a=rtpmap" line of SDP MUST be H265 (the media subtype). Schierl, et al Expires August 27, 2012 [Page 39] Internet-Draft RTP Payload Format for HEVC February 2012 o The clock rate in the "a=rtpmap" line MUST be 90000. o The OPTIONAL parameters "profile-level-id", "packetization-mode", when present, MUST be included in the "a=fmtp" line of SDP. These parameters are expressed as a media type string, in the form of a semicolon separated list of parameter=value pairs. o The OPTIONAL parameters "sprop-parameter-sets" and "sprop-level- parameter-sets", when present, MUST be included in the "a=fmtp" line of SDP or conveyed using the "fmtp" source attribute as specified in section 6.3 of [RFC5576]. For a particular media format (i.e., RTP payload type), a "sprop-parameter-sets" or "sprop-level-parameter-sets" MUST NOT be both included in the "a=fmtp" line of SDP and conveyed using the "fmtp" source attribute. When included in the "a=fmtp" line of SDP, these parameters are expressed as a media type string, in the form of a semicolon separated list of parameter=value pairs. When conveyed using the "fmtp" source attribute, these parameters are only associated with the given source and payload type as parts of the "fmtp" source attribute. Informative note: Conveyance of "sprop-parameter-sets" and "sprop-level-parameter-sets" using the "fmtp" source attribute allows for out-of-band transport of parameter sets in topologies like Topo-Video-switch-MCU [TBD]. An example of media representation in SDP is as follows: m=video 49170 RTP/AVP 98 a=rtpmap:98 H265/90000 a=fmtp:98 profile-level-id=UVWXYZ; packetization-mode=1; sprop-parameter-sets= 7.2.3 Usage with SDP Offer/Answer Model TBD 7.2.4 Usage in Declarative Session Descriptions TBD 7.2.5 Signaling of Parallel Processing TBD Schierl, et al Expires August 27, 2012 [Page 40] Internet-Draft RTP Payload Format for HEVC February 2012 7.3 Examples TBD. 7.4 Parameter Set Considerations TBD 8. Security Considerations TBD 9. Congestion Control TBD 10. IANA Consideration A new media type, as specified in Section 7.1 of this memo, should be registered with IANA. 11. Informative Appendix: Application Examples 11.1 Introduction TBD 11.2 Streaming TBD 11.3 Videoconferencing (Unicast to MANE, Unicast to Endpoints) TBD 11.4 Mobile TV (Multicast to MANE, Unicast to Endpoint) TBD 12. Acknowledgements TBD This document was prepared using 2-Word-v2.0.template.dot. Schierl, et al Expires August 27, 2012 [Page 41] Internet-Draft RTP Payload Format for HEVC February 2012 13. References 13.1 Normative References [HEVC] JCT-VC, "High-Efficiency Video Coding (HEVC) text specification Working Draft 6", JCTVC-H1003, February 2012. [H.264] ITU-T Recommendation H.264, "Advanced video coding for generic audiovisual services", March 2010. [RFC6184] Wang, Y.-K., Even, R., Kristensen, T., and R. Jesup, "RTP Payload Format for H.264 Video", RFC 6184, May 2011. [RFC6190] Wenger, S., Wang, Y.-K., Schierl, T., and A. Eleftheriadis, "RTP Payload Format for Scalable Video Coding", RFC 6190, May 2011. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model With Session Description Protocol (SDP)", RFC 3264, June 2002. [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data Encodings", RFC 4648, October 2006. [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and Jacobson, V., "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003. [RFC4566] Handley, M., Jacobson, V., and Perkins, C., "SDP: Session Description Protocol", RFC 4566, July 2006. [RFC5576] Lennox, J., Ott, J., and Schierl, T., "Source-Specific Media Attributes in the Session Description Protocol", RFC 5576, June 2009. 13.2 Informative References [RFC5109] Li, A., "RTP Payload Format for Generic Forward Error Correction", RFC 5109, December 2007. Schierl, et al Expires August 27, 2012 [Page 42] Internet-Draft RTP Payload Format for HEVC February 2012 14. Authors' Addresses Thomas Schierl Fraunhofer HHI Einsteinufer 37 D-10587 Berlin Germany Phone: +49-30-31002-227 EMail: ts@thomas-schierl.de Stephan Wenger Vidyo, Inc. 433 Hackensack Ave., 7th floor Hackensack, N.J. 07601 USA Phone: +1-415-713-5473 EMail: stewe@stewe.org Ye-Kui Wang Qualcomm Incorporated 5775 Morehouse Drive San Diego, CA 92121 USA Phone: +1-858-651-8345 EMail: yekuiw@qualcomm.com Miska M. Hannuksela Nokia Corporation P.O. Box 1000 33721 Tampere Finland Phone: +358-7180-08000 EMail: miska.hannuksela@nokia.com Schierl, et al Expires August 27, 2012 [Page 43]