Network Working Group T. Davies Internet-Draft Cisco Intended status: Standards Track October 19, 2015 Expires: January 7, 2016 Interpolated reference frames for video coding draft-davies-netvc-irfvc-00 Abstract This document describes the use of interpolated reference frames in video coding in general, and in the Thor video codec in particular. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 7, 2016. Copyright Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Davies Expires January 7, 2016 [Page 1] Internet-Draft IRFVC October 2015 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 2.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 3. The interpolation process . . . . . . . . . . . . . . . . . . 3 3.1. Interpolation framework . . . . . . . . . . . . . . . . . 3 3.2. Motion estimation process . . . . . . . . . . . . . . . . 4 3.3. Complexity considerations . . . . . . . . . . . . . . . . 5 4. Coding using interpolated reference frames . . . . . . . . . 6 5. Compression performance . . . . . . . . . . . . . . . . . . . 6 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 7. Security Considerations . . . . . . . . . . . . . . . . . . . 8 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 8 9. Normative References . . . . . . . . . . . . . . . . . . . . . 8 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 8 1. Introduction This document describes a method of generating synthetic reference frames for video coding using a simplified frame interpolation method. The aim is to create a reference frame that is temporally co-located with the current frame being predicted, leveraging the motion information already present in the previously-coded frames, and removing the need for techniques such as motion vector scaling in motion vector prediction. Since the decoder will have to generate the same interpolated reference frame as the encoder, complexity considerations are a paramount concern. The interpolation process is therefore a highly simplified block-matching algorithm and uses only pixel-accurate motion vectors, for example. Worst-case complexity can be managed by controlling the number of matches per block, per region and per frame as well as the total vertical excursion to manage memory bandwidth. The method gives most gain in Thor at high quantisation (QP) levels i.e. low bitrates. Overall, Bjontegaard delta-rate (BDR) reductions across QP ranges 22-37 are on average 5.2% for a range of HD test sequences. For higher QP (32-44) the reductions gains are larger: 8.8% on average. Interpolated reference frames are enabled by default in the high complexity random access (RA) and High Delay B (HDB) configurations in the Thor repository github.com/cisco/thor. Davies Expires January 7, 2016 [Page 2] Internet-Draft IRFVC October 2015 Section 3 describes the interpolation process, which is based on a simplified hierarchical motion estimation (HME). Section 4 describes the modifications to the Thor syntax coding processes. Section 5 provides details of compression performance. 2. Definitions 2.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 2.2. Terminology This document frequently uses the following terms. MV: Motion vector - a horizontal and vertical vector displacement (x,y) ME: Motion Estimation HME: Hierarchical ME SAD: Sum of Absolute Differences. A metric defined for a pair of equal dimension blocks of numerical vaules consisting of the sum of the absolute differences of the corresponding values in each location in the blocks QP: quantisation parameter BDR: Bjontegaard Delta-Rate 3. The interpolation process 3.1. Interpolation framework Consider two frames R0 and R1 and a frame F equidistant in time between them which is to be interpolated (Figure 1). Image data must be created for each block in F by combining information from R0 and R1 using a linear model for the block motion. Davies Expires January 7, 2016 [Page 3] Internet-Draft IRFVC October 2015 ______________________________|_______|__________________________ R0 | /\ | / / / mv0 / / ________________________|____/__|________________________________ F | / | / / / mv1 / / __________________|___\/__|______________________________________ R1 | | Figure 1: forward and backward motion pairs for a block For each block in the frame F there is an associated motion vector mv0 pointing at a displaced block in R0 and a corresponding motion vector mv1 which is equal to -mv0 pointing at R1. Where F is not equidistant from the reference frames the linear model can simply be scaled appropriately. If both blocks fall within the reference frames, then the interpolated block is just the average of the two reference blocks. At the edges of the frames one of the reference blocks may fall off the edge - here the other reference only is used instead. 3.2. Motion estimation process Since F does not exist the motion estimation process consists of matching blocks B+mv0 in R0 with blocks B+mv1 in R1. A basic block size of 8x8 is used but the bulk of the motion estimation is done for 16x16 blocks. For UHD resolutions, perhaps a larger basic block size would be better. The overall approach is to use hierarchical motion estimation (HME), as this is amenable to limiting both average and worst-case complexity. In the HME scheme each reference frame is down-scaled vertically and horizontally by a factor 2, using a (1/2,1/2) filter. This is done repeatedly to get a series R0(n) and R1(n) of reference frames. Then motion estimation is done very simply on each resolution layer n, but Davies Expires January 7, 2016 [Page 4] Internet-Draft IRFVC October 2015 using candidates from next layer (n+1) as well as spatial neighbours. The block sizes are the same at each layer, so each block at layer n+1 corresponds to 4 blocks at layer n. For each layer, the ME stages are as follows: 1. For each 16x16 block in raster order: a. Check if ME can be bypassed. b. If not bypassed, determine candidates from lower layer blocks and from neighbour blocks in raster order c. Perform an adaptive cross search around each candidate vector and determine the best vector 2. For each 8x8 block in raster order, find the best merge candidate, i.e. choose which MV to use: the original 16x16 block vector, or one of 4 neighbouring block 16x16 vectors (above, below, left or right) The majority of blocks bypass ME at step 1a. Here a skip candidate is generated as: skipmv = argmin{mvx in {mv0,mv1,mv2}: sum_{i=0}^{2} |mvx-mvi|} where mv0,mv1,and mv2 are the motion vectors for blocks above, left and above-right the current block. If the cost for this vector is below a fixed value for each 8x8 sub-block, no further ME is done. In step 1c, the ranges of the cross search are restricted to just 2 steps (max 8 matches) for each candidate, if the search is not at the lowest resolution layer. This is because vector candidates from the lower layer or from neighbours will already be highly accurate by this point. In step 1, the cost metric is a combination of luma SAD and a fixed multiple of the sum of abolute motion vector difference between the vector mvx and the four neighbours mv0,mv1,mv2,mv3 to the left, right, above and above right, i.e. sum_{i=0}^{3} |mvx-mvi| This helps make the motion estimation process less sensitive to noise and spurious matches. In step 2 the cost metric is SAD alone. 3.3. Complexity considerations The ME process is not that sensitive to the selection of candidates, Davies Expires January 7, 2016 [Page 5] Internet-Draft IRFVC October 2015 at least in terms of the impact on coding performance. If the interpolated frames are used directly this might not be so, but in effect the interpolated blocks are only going to be used for prediction if they are interpolated well: therefore effort refining bad matches is generally wasted, so should be avoided. This means that the ME process can be quite truncated. The only candidates considered are up to three neighbour block candidates and one from the layer below. The majority of motion estimation is skipped, and so only requires a single match. For HW applications the total number of matches would still require a hard limit, as well as limits for the matches per block and possibly per region. Vertical motion vector limits could also be imposed to reduce memory bandwidth costs. 4. Coding using interpolated reference frames In the Thor implementation, when an interpolated reference frame is used it is inserted at the beginning of the reference pictures list and is given the same frame number as the current frame. Typically use of the interpolated reference frame causes a considerable increase in uni-pred prediction, often with no residual to code, and a reduction of bi-prediction modes. This changes the probability of the various supermode values used in Thor. Therefore in such frames it makes sense to modify the supermode coding to reflect this, and this contributes a small amount to coding gains. Full details are in [Fuld1]. 5. Compression performance Luma PSNR BDR percentage gains for standard QP ranges (22,27,32,37) are given in Table 1. For high QP (32,36,40,44), the results are in Table 2. Davies Expires January 7, 2016 [Page 6] Internet-Draft IRFVC October 2015 ------------------------------------------------------- 1920x1080 ------------------------------------------------------- Kimono -3.5 ParkScene -3.1 Cactus -4.9 BasketballDrive -2.1 BQTerrace -1.9 ChangeSeats -5.8 HeaAndShoulder -6.6 TelePresence -6.6 WhiteBoard -7.5 ------------------------------------------------------- 1280x720 ------------------------------------------------------- FourPeople -7.0 Johnny -6.2 KristenAndSara -7.0 ------------------------------------------------------- Average -5.2 Table 1: BDR reductions for standard QPs ------------------------------------------------------- 1920x1080 ------------------------------------------------------- Kimono -6.6 ParkScene -7.0 Cactus -8.9 BasketballDrive -5.5 BQTerrace -4.7 ChangeSeats -12.1 HeaAndShoulder -10.1 TelePresence -11.0 WhiteBoard -12.4 ------------------------------------------------------- 1280x720 ------------------------------------------------------- FourPeople -9.1 Johnny -8.0 KristenAndSara -9.9 ------------------------------------------------------- Average -8.8 Table 2: BDR reductions for high QPs Davies Expires January 7, 2016 [Page 7] Internet-Draft IRFVC October 2015 6. IANA Considerations This document has no IANA considerations. 7. Security Considerations This document has no security considerations. 8. Acknowledgements The author would like to thank Arild Fuldseth for assistance with experimental investigations, and Mo Zanaty for reviewing this document. 9. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [Fuld1] Fuldseth, A., Bjontegaard, G., Zanaty, M. "The Thor video codec", draft-fuldseth-netvc-thor-01, October 2015. Authors' Addresses Thomas Davies Cisco Feltham UK Email: thdavies@cisco.com Davies Expires January 7, 2016 [Page 8] Internet-Draft IRFVC October 2015