Network Working Group                                          T. Davies
Internet-Draft                                                     Cisco
Intended status: Standards Track                        October 19, 2015
Expires: January 7, 2016


              Interpolated reference frames for video coding
                      draft-davies-netvc-irfvc-00

Abstract

   This document describes the use of interpolated reference frames in
   video coding in general, and in the Thor video codec in particular.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 7, 2016.

Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Davies                   Expires January 7, 2016                [Page 1]

Internet-Draft                   IRFVC                      October 2015


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Definitions . . . . . . . . . . . . . . . . . . . . . . . . .   3
     2.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
     2.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  The interpolation process . . . . . . . . . . . . . . . . . .   3
     3.1.  Interpolation framework . . . . . . . . . . . . . . . . .   3
     3.2.  Motion estimation process . . . . . . . . . . . . . . . .   4
     3.3.  Complexity considerations . . . . . . . . . . . . . . . .   5
   4.  Coding using interpolated reference frames  . . . . . . . . .   6
   5.  Compression performance . . . . . . . . . . . . . . . . . . .   6
   6. IANA Considerations  . . . . . . . . . . . . . . . . . . . . .   8
   7. Security Considerations  . . . . . . . . . . . . . . . . . . .   8
   8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .   8
   9. Normative References . . . . . . . . . . . . . . . . . . . . .   8
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .   8

1.  Introduction

   This document describes a method of generating synthetic reference
   frames for video coding using a simplified frame interpolation
   method. The aim is to create a reference frame that is temporally
   co-located with the current frame being predicted, leveraging the
   motion information already present in the previously-coded frames,
   and removing the need for techniques such as motion vector scaling
   in motion vector prediction.

   Since the decoder will have to generate the same interpolated
   reference frame as the encoder, complexity considerations are a
   paramount concern. The interpolation process is therefore a highly
   simplified block-matching algorithm and uses only pixel-accurate
   motion vectors, for example. Worst-case complexity can be managed by
   controlling the number of matches per block, per region and per
   frame as well as the total vertical excursion to manage memory
   bandwidth.

   The method gives most gain in Thor at high quantisation (QP) levels
   i.e. low bitrates. Overall, Bjontegaard delta-rate (BDR) reductions
   across QP ranges 22-37 are on average 5.2% for a range of HD test
   sequences.  For higher QP (32-44) the reductions gains are larger:
   8.8% on average.


   Interpolated reference frames are enabled by default in the high
   complexity random access (RA) and High Delay B (HDB) configurations
   in the Thor repository github.com/cisco/thor.


Davies                   Expires January 7, 2016                [Page 2]

Internet-Draft                   IRFVC                      October 2015


   Section 3 describes the interpolation process, which is based on
   a simplified hierarchical motion estimation (HME). Section 4
   describes the modifications to the Thor syntax coding processes.
   Section 5 provides details of compression performance.

2.  Definitions

2.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].


2.2.  Terminology

   This document frequently uses the following terms.

      MV: Motion vector - a horizontal and vertical vector displacement
      (x,y)

      ME: Motion Estimation

      HME: Hierarchical ME

      SAD: Sum of Absolute Differences. A metric defined for a pair of
      equal dimension blocks of numerical vaules consisting of the sum
      of the absolute differences of the corresponding values in each
      location in the blocks

      QP: quantisation parameter

      BDR: Bjontegaard Delta-Rate

3.  The interpolation process

3.1.  Interpolation framework

  Consider two frames R0 and R1 and a frame F equidistant in time
  between them which is to be interpolated (Figure 1). Image data must
  be created for each block in F by combining information from R0 and
  R1 using a linear model for the block motion.


Davies                   Expires January 7, 2016                [Page 3]

Internet-Draft                   IRFVC                      October 2015


    ______________________________|_______|__________________________ R0
                                  |   /\  |
                                      /
                                     /
                                    / mv0
                                   /
                                  /
    ________________________|____/__|________________________________ F
                            |   /   |
                               /
                              /
                             / mv1
                            /
                           /
    __________________|___\/__|______________________________________ R1
                      |       |


            Figure 1: forward and backward motion pairs for a block

   For each block in the frame F there is an associated motion vector
   mv0 pointing at a displaced block in R0 and a corresponding motion
   vector mv1 which is equal to -mv0 pointing at R1.

   Where F is not equidistant from the reference frames the linear model
   can simply be scaled appropriately.

   If both blocks fall within the reference frames, then the
   interpolated block is just the average of the two reference blocks.
   At the edges of the frames one of the reference blocks may fall off
   the edge - here the other reference only is used instead.

3.2.  Motion estimation process

   Since F does not exist the motion estimation process consists of
   matching blocks B+mv0 in R0 with blocks B+mv1 in R1. A basic block
   size of 8x8 is used but the bulk of the motion estimation is done for
   16x16 blocks. For UHD resolutions, perhaps a larger basic block size
   would be better. The overall approach is to use hierarchical motion
   estimation (HME), as this is amenable to limiting both average and
   worst-case complexity.

   In the HME scheme each reference frame is down-scaled vertically and
   horizontally by a factor 2, using a (1/2,1/2) filter. This is done
   repeatedly to get a series R0(n) and R1(n) of reference frames. Then
   motion estimation is done very simply on each resolution layer n, but


Davies                   Expires January 7, 2016                [Page 4]

Internet-Draft                   IRFVC                      October 2015


   using candidates from next layer (n+1) as well as spatial neighbours.
   The block sizes are the same at each layer, so each block at layer
   n+1 corresponds to 4 blocks at layer n.

   For each layer, the ME stages are as follows:

      1.  For each 16x16 block in raster order:
         a.  Check if ME can be bypassed.
         b. If not bypassed, determine candidates from lower layer
	    blocks and from neighbour blocks in raster order
         c. Perform an adaptive cross search around each candidate
	    vector and determine the best vector

      2.  For each 8x8 block in raster order, find the best merge
          candidate, i.e. choose which MV to use: the original 16x16
	  block vector, or one of 4 neighbouring block 16x16 vectors
	  (above, below, left or right)

   The majority of blocks bypass ME at step 1a. Here a skip candidate
   is generated as:

   skipmv = argmin{mvx in {mv0,mv1,mv2}: sum_{i=0}^{2} |mvx-mvi|}

   where mv0,mv1,and mv2 are the motion vectors for blocks above, left
   and above-right the current block. If the cost for this vector is
   below a fixed value for each 8x8 sub-block, no further ME is done.

   In step 1c, the ranges of the cross search are restricted to just 2
   steps (max 8 matches) for each candidate, if the search is not at
   the lowest resolution layer. This is because vector candidates from
   the lower layer or from neighbours will already be highly accurate by
   this point.

   In step 1, the cost metric is a combination of luma SAD and a fixed
   multiple of the sum of abolute motion vector difference between the
   vector mvx and the four neighbours mv0,mv1,mv2,mv3 to the left,
   right, above and above right, i.e.

   sum_{i=0}^{3} |mvx-mvi|

   This helps make the motion estimation process less sensitive to noise
   and spurious matches.

   In step 2 the cost metric is SAD alone.

3.3.  Complexity considerations

   The ME process is not that sensitive to the selection of candidates,


Davies                   Expires January 7, 2016                [Page 5]

Internet-Draft                   IRFVC                      October 2015


   at least in terms of the impact on coding performance. If the
   interpolated frames are used directly this might not be so, but in
   effect the interpolated blocks are only going to be used for
   prediction if they are interpolated well: therefore effort refining
   bad matches is generally wasted, so should be avoided.

   This means that the ME process can be quite truncated. The only
   candidates considered are up to three neighbour block candidates and
   one from the layer below. The majority of motion estimation is
   skipped, and so only requires a single match. For HW applications
   the total number of matches would still require a hard limit, as
   well as limits for the matches per block and possibly per region.
   Vertical motion vector limits could also be imposed to reduce memory
   bandwidth costs.

4.  Coding using interpolated reference frames

   In the Thor implementation, when an interpolated reference frame is
   used it is inserted at the beginning of the reference pictures list
   and is given the same frame number as the current frame. Typically
   use of the interpolated reference frame causes a considerable
   increase in uni-pred prediction, often with no residual to code, and
   a reduction of bi-prediction modes. This changes the probability of
   the various supermode values used in Thor. Therefore in such frames
   it makes sense to modify the supermode coding to reflect this, and
   this contributes a small amount to coding gains. Full details are in
   [Fuld1].

5.  Compression performance

  Luma PSNR BDR percentage gains for standard QP ranges (22,27,32,37)
  are given in Table 1. For high QP (32,36,40,44), the results are in
  Table 2.


Davies                   Expires January 7, 2016                [Page 6]

Internet-Draft                   IRFVC                      October 2015


  -------------------------------------------------------
  1920x1080
  -------------------------------------------------------
  Kimono                                             -3.5
  ParkScene                                          -3.1
  Cactus                                             -4.9
  BasketballDrive                                    -2.1
  BQTerrace                                          -1.9
  ChangeSeats                                        -5.8
  HeaAndShoulder                                     -6.6
  TelePresence                                       -6.6
  WhiteBoard                                         -7.5
  -------------------------------------------------------
  1280x720
  -------------------------------------------------------
  FourPeople                                         -7.0
  Johnny                                             -6.2
  KristenAndSara                                     -7.0
  -------------------------------------------------------
  Average                                            -5.2

        Table 1: BDR reductions for standard QPs

  -------------------------------------------------------
  1920x1080
  -------------------------------------------------------
  Kimono                                             -6.6
  ParkScene                                          -7.0
  Cactus                                             -8.9
  BasketballDrive                                    -5.5
  BQTerrace                                          -4.7
  ChangeSeats                                       -12.1
  HeaAndShoulder                                    -10.1
  TelePresence                                      -11.0
  WhiteBoard                                        -12.4
  -------------------------------------------------------
  1280x720
  -------------------------------------------------------
  FourPeople                                         -9.1
  Johnny                                             -8.0
  KristenAndSara                                     -9.9
  -------------------------------------------------------
  Average                                            -8.8

        Table 2: BDR reductions for high QPs


Davies                   Expires January 7, 2016                [Page 7]

Internet-Draft                   IRFVC                      October 2015


6.  IANA Considerations

   This document has no IANA considerations.

7.  Security Considerations

   This document has no security considerations.

8.  Acknowledgements

   The author would like to thank Arild Fuldseth for assistance with
   experimental investigations, and Mo Zanaty for reviewing this
   document.


9.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [Fuld1]    Fuldseth, A., Bjontegaard, G., Zanaty, M. "The Thor video
              codec", draft-fuldseth-netvc-thor-01, October 2015.

Authors' Addresses

   Thomas Davies
   Cisco
   Feltham
   UK

   Email: thdavies@cisco.com


Davies                   Expires January 7, 2016                [Page 8]

Internet-Draft                   IRFVC                      October 2015