briscoe-iccrg-prague-congestion-control-03.txt

Internet DRAFT - draft-briscoe-iccrg-prague-congestion-control
draft-briscoe-iccrg-prague-congestion-control

Last Version:	draft-briscoe-iccrg-prague-congestion-control-03.txt	Tracker Entry
Date:	`14-Oct-2023`
Disposition:	current
Previous Versions:	draft-briscoe-iccrg-prague-congestion-control-02.txt (diff) - 11-Mar-2023
	draft-briscoe-iccrg-prague-congestion-control-01.txt (diff) - 12-Jul-2022
	draft-briscoe-iccrg-prague-congestion-control-00.txt (diff) - 10-Mar-2021




Internet Congestion Control Research Group (ICCRG)        K. De Schepper
Internet-Draft                                                O. Tilmans
Intended status: Experimental                            Nokia Bell Labs
Expires: 16 April 2024                                   B. Briscoe, Ed.
                                                             Independent
                                                                 V. Goel
                                                               Apple Inc
                                                         14 October 2023


                       Prague Congestion Control
            draft-briscoe-iccrg-prague-congestion-control-03

Abstract

   This specification defines the Prague congestion control scheme,
   which is derived from DCTCP and adapted for Internet traffic by
   implementing the Prague L4S requirements.  Over paths with L4S
   support at the bottleneck, it adapts the DCTCP mechanisms to achieve
   consistently low latency and full throughput.  It is defined
   independently of any particular transport protocol or operating
   system, but notes are added that highlight issues specific to certain
   transports and OSs.  It is mainly based on experience with the
   reference Linux implementation of TCP Prague and the Apple
   implementation over QUIC, but it includes experience from other
   implementations where available.

   The implementation does not satisfy all the Prague requirements (yet)
   and the IETF might decide that certain requirements need to be
   relaxed as an outcome of the process of trying to satisfy them all.
   Future plans that have typically only been implemented as proof-of-
   concept code are outlined in a separate section.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."




De Schepper, et al.       Expires 16 April 2024                 [Page 1]

Internet-Draft          Prague Congestion Control           October 2023


   This Internet-Draft will expire on 16 April 2024.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Motivation: Low Queuing Delay /and/ Full Throughput . . .   4
     1.2.  Document Purpose  . . . . . . . . . . . . . . . . . . . .   5
     1.3.  Maturity Status (To be Removed Before Publication)  . . .   6
     1.4.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   6
   2.  Prague Congestion Control . . . . . . . . . . . . . . . . . .   8
     2.1.  The Prague L4S Requirements . . . . . . . . . . . . . . .   9
     2.2.  Packet Identification . . . . . . . . . . . . . . . . . .  11
     2.3.  Detecting and Measuring Congestion  . . . . . . . . . . .  11
       2.3.1.  Accurate ECN Feedback . . . . . . . . . . . . . . . .  11
         2.3.1.1.  Accurate ECN Feedback with TCP & Derivatives  . .  11
         2.3.1.2.  Accurate ECN Feedback with Other Modern
                 Transports  . . . . . . . . . . . . . . . . . . . .  12
       2.3.2.  Moving Average of ECN Feedback  . . . . . . . . . . .  13
       2.3.3.  Scaling Loss Detection with Flow Rate . . . . . . . .  14
     2.4.  Congestion Response Algorithm . . . . . . . . . . . . . .  15
       2.4.1.  Loss behaviour  . . . . . . . . . . . . . . . . . . .  15
       2.4.2.  Multiplicative Decrease on ECN Feedback . . . . . . .  16
       2.4.3.  Additive Increase and ECN Feedback  . . . . . . . . .  17
       2.4.4.  Reduced RTT-Dependence  . . . . . . . . . . . . . . .  18
       2.4.5.  Flow Start or Restart . . . . . . . . . . . . . . . .  20
     2.5.  Packet Sending  . . . . . . . . . . . . . . . . . . . . .  20
       2.5.1.  Packet Pacing . . . . . . . . . . . . . . . . . . . .  20
       2.5.2.  Segmentation Offload  . . . . . . . . . . . . . . . .  21
   3.  Variants and Future Work  . . . . . . . . . . . . . . . . . .  21
     3.1.  Getting up to Speed Faster  . . . . . . . . . . . . . . .  21
       3.1.1.  Flow Start (or Restart) . . . . . . . . . . . . . . .  22
       3.1.2.  Faster than Additive Increase . . . . . . . . . . . .  23
       3.1.3.  Remove Lag in Congestion Response . . . . . . . . . .  24
     3.2.  Combining Congestion Metrics  . . . . . . . . . . . . . .  25



De Schepper, et al.       Expires 16 April 2024                 [Page 2]

Internet-Draft          Prague Congestion Control           October 2023


       3.2.1.  ECN with Loss . . . . . . . . . . . . . . . . . . . .  25
       3.2.2.  ECN with Delay  . . . . . . . . . . . . . . . . . . .  26
     3.3.  Fall-Back on Classic ECN  . . . . . . . . . . . . . . . .  26
     3.4.  Further Reduced RTT-Dependence  . . . . . . . . . . . . .  27
     3.5.  Scaling Down to Fractional Windows  . . . . . . . . . . .  27
   4.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  28
   5.  Security Considerations . . . . . . . . . . . . . . . . . . .  28
   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  28
     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  28
     6.2.  Informative References  . . . . . . . . . . . . . . . . .  29
   Comments and Contributions Solicited  . . . . . . . . . . . . . .  32
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  32
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  33

1.  Introduction

   This document defines the Prague Congestion Control Algorithm (CCA),
   which is intended to maintain consistently low queuing delay over
   network paths that offer Low Latency, Low Loss, and Scalable
   throughput (L4S) support at the bottleneck [RFC9330].  Where the
   bottleneck does not support L4S, a Prague CCA is intended to fall
   back to behaving like a conventional 'Classic' congestion control.
   L4S support in the network involves Active Queue Management (AQM)
   that applies 'Immediate Explicit Congestion Notification (ECN)' at a
   very shallow target queueing delay (of the order of a millisecond) or
   even a virtual queue with no queuing delay, for example [RFC9332].
   'Immediate ECN' means that the network applies ECN marking based on
   the instantaneous queue, without any smoothing or filtering, The
   Prague CCA takes on the job of smoothing and filtering the congestion
   signals from the network.

   The Prague CCA is a particular instance of a scalable congestion
   control, which is defined in Section 1.4.  Scalable congestion
   control is the part of the L4S architecture that does the actual work
   of maintaining low queuing delay and ensuring that the delay and
   throughput properties scale with flow rate.

   The L4S architecture [RFC9330] places the host congestion control in
   the context of the other parts of the system.  In particular the
   different types of L4S AQM in the network and the codepoints in the
   IP-ECN field that convey to the network that the host supports the
   L4S form of ECN.  The architecture document also covers other issues
   such as: incremental deployment; protection of low latency queues
   against accidental or malicious disruption; and the relationship of
   L4S to other low latency technologies.  The specification of the L4S
   ECN Protocol [RFC9331] sets down the requirements that a Prague CCA
   has to follow (called the Prague L4S Requirements - see Section 2.1
   for a summary).



De Schepper, et al.       Expires 16 April 2024                 [Page 3]

Internet-Draft          Prague Congestion Control           October 2023


   This specification defines the Prague CCA independent of any
   particular transport protocol or operating system, but notes are
   added that highlight issues specific to certain transports and OSs.
   These primarily draw on experience with i) the reference
   implementation of Prague on Linux over TCP, and ii) the Apple
   implementation of Prague over QUIC.  Nonetheless, wherever possible,
   experience from implementers on other platforms is included.

   Links to implementations of the Prague CCA and other scalable
   congestion controls can be found via the L4S landing page [L4S-home],
   which also links to numerous other L4S-related resources.  A
   (slightly dated) paper on the specific implementation of the Prague
   CCA in Linux over TCP is also available [PragueLinux], and the open-
   source code for Linux is at [linux-code].

1.1.  Motivation: Low Queuing Delay /and/ Full Throughput

   A Prague CCA is capable of keeping queuing delay consistently low
   while fully utilizing available capacity.  In contrast, Classic
   congestion controls need to induce a reasonably large queue
   (approaching a bandwidth-delay product) in order to fully utilize
   capacity.  Therefore, prior to scalable CCAs like DCTCP and Prague,
   it was believed that very low delay was only possible by limiting
   throughput and isolating the low delay traffic from capacity-seeking
   traffic.

   A Prague CCA uses additive increase multiplicative decrease (AIMD),
   in which it increases its window until an ECN mark (or loss) is
   detected, then yields in a continual sawtooth pattern.  The key to
   keeping queuing delay low without under-utilizing capacity is to keep
   the sawteeth tiny.  This is achieved by ensuring that the reduction
   in rate that starts each sawtooth is proportionate to the prevailing
   level of congestion.  In contrast, a classical CCA merely responds to
   the existence of congestion, not its extent.  So each classical
   reduction has to be large enough to cope with the worst case.

   For example the average duration of a Prague CCA sawtooth is of the
   order of a round trip, whereas a classic congestion control sawtooths
   over hundreds of round trips, implying multiple seconds.  For
   instance, at 120 Mb/s and with a maximum RTT of 30 ms at the peak of
   each sawtooth, CUBIC takes 4.3 s to recover from each sawtooth
   reduction.  At this rate, CUBIC is still fully in its Reno-friendly
   mode.  If flow rate scales by 8 times to 960 Mb/s, it enters true
   CUBIC mode, with a recovery time of 12.2 s.  From then on, each
   further scaling by 8 times doubles CUBIC's recovery time (because the
   cube root of 8 is 2), e.g., at 7.68 Gb/s, the recovery time is 24.3
   s.




De Schepper, et al.       Expires 16 April 2024                 [Page 4]

Internet-Draft          Prague Congestion Control           October 2023


   Keeping the sawtooth amplitude down keeps queue variation down and
   utilization up.  Keeping the duration of the sawteeth down ensures
   control remains tight.  The definition of a scalable CCA is that the
   duration between congestion marks does not increase as flow rate
   scales, all other factors being equal.  This is important, because it
   means that the sawteeth will always stay tiny.  So queue delay will
   remain very low, and control will remain very tight.

   The tip of each sawtooth occurs when the bottleneck link emits a
   congestion signal.  Therefore such small sawteeth are only feasible
   when ECN is used for the congestion signals.  If loss were used after
   each brief recovery time, the loss level would be prohibitively high.
   This is why L4S-ECN has to depart from the requirement of Classic
   ECN[RFC3168] that an ECN mark is equivalent to a loss.

   The Prague CCA is derived from Data Center TCP (DCTCP [RFC8257]).
   DCTCP is confined to controlled environments like data centres
   precisely because it uses such small sawteeth, which induce such a
   high level of congestion marking.  For a CCA using Classic ECN, this
   would be interpreted as equivalent to the same, very high, loss
   level.  The Classic CCA would then continually drive its own rate
   down in the face of such an apparently high level of congestion.

   This is why coexistence with existing traffic is important for the
   Prague CCA.  It has to be able to detect whether it is sharing a
   Classic bottleneck with Classic traffic, and if so fall back to
   behaving in a Classic way.  If the bottleneck does not support ECN at
   all, that is easy - a Prague CCA just responds in the Classic way to
   loss (see Section 2.4.1).  But if it is sharing a Classic ECN
   bottleneck with Classic ECN traffic, this is more difficult to detect
   (see Section 3.3).  Because the Prague CCA removes most of the queue,
   it also has to address RTT-dependence.  Otherwise, at low base RTTs,
   its flow rate would become far more RTT-dependent than Classic CCAs.

1.2.  Document Purpose

   There is not 'One True Prague CCA'.  L4S is intended to enable
   development of any scalable CCA that meets the L4S Prague
   requirements [RFC9331].  This document attempts to describe a design
   that transfers a byte stream.  It is generalized across different
   transports and OS platforms.










De Schepper, et al.       Expires 16 April 2024                 [Page 5]

Internet-Draft          Prague Congestion Control           October 2023


1.3.  Maturity Status (To be Removed Before Publication)

   The field of congestion control is always a work in progress.
   However, there are areas of Prague implementations that are still
   just placeholders while separate research code is evaluated.  And in
   other implementations of the Prague CCA, other areas are incomplete.
   In the Linux reference implementation of TCP Prague, interim code is
   used in the incomplete areas, which are:

   *  Flow start and restart (standard slow start is used, even though
      it often exits early in L4S environments were ECN marking tends to
      be frequent);

   *  Faster than additive increase (standard Reno additive increase is
      used, which makes the flow particularly sluggish if it has dropped
      out of slow start early).

   The body of this document describes the Prague CCA as implemented.
   Any non-default options or any planned improvements are separated out
   into Section 3 on "Variants and Future Work".  As each of the above
   areas is addressed, it will will be removed from this section and its
   description in the body of the document will be updated.  Once all
   areas are complete, this section will be removed.  Prague will then
   still be a work in progress, but only on a similar footing as all
   other congestion controls.

   No implementation satisfies all the Prague requirements (yet), and
   the IETF might decide that certain requirements need to be relaxed as
   an outcome of the process of trying to satisfy them all.

1.4.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

   Definitions of terms:

   Classic Congestion Control:  A congestion control behaviour that can










De Schepper, et al.       Expires 16 April 2024                 [Page 6]

Internet-Draft          Prague Congestion Control           October 2023


      co-exist with standard TCP Reno [RFC5681] without causing
      significantly negative impact on its flow rate [RFC5033].  With
      Classic congestion controls, as flow rate scales, the number of
      round trips between congestion signals (losses or ECN marks) rises
      with the flow rate.  So it takes longer and longer to recover
      after each congestion event.  Therefore control of queuing and
      utilization becomes very slack, and the slightest disturbance
      prevents a high rate from being attained [RFC3649].

   Scalable Congestion Control:  A CCA where the average time from one
      congestion signal to the next (the recovery time) remains
      invariant as the flow rate scales, all other factors being equal.
      This maintains the same degree of control over queueing and
      utilization whatever the flow rate, as well as ensuring that high
      throughput is robust to disturbances.  For instance, DCTCP
      averages 2 congestion signals per round-trip whatever the flow
      rate.  For the public Internet a Scalable CCA has to comply with
      the requirements in Section 4 of [RFC9331] (aka.  the 'Prague L4S
      requirements').

   Response function:  The relationship in steady state between the
      window (cwnd) of a congestion control and the congestion
      signalling probability, p, [RFC3649].  A widely used family of
      response functions has the form cwnd = K/p^B, where K and B are
      constants.  In an approximation of the response function of the
      standard Reno CCA, B=1/2.  For a scalable congestion control B=1,
      so its response function takes the form cwnd = K/p.  The number of
      congestion signals per round is p*cwnd, which equates to the
      constant, K, for a scalable CCA.  Hence the definition of a
      scalable CCA above.

   Reno-friendly:  The subset of Classic traffic that excludes
      unresponsive traffic and excludes experimental congestion controls
      intended to coexist with Reno but without always being strictly
      friendly to it (as allowed by [RFC5033]).  Reno-friendly is used
      in place of 'TCP-friendly', given that the TCP protocol is used
      with many different congestion control behaviours.

   Classic ECN:  The original Explicit Congestion Notification (ECN)
      protocol [RFC3168], which requires ECN signals to be treated the
      same as drops, both when generated in the network and when
      responded to by the sender.

      The names used for the four codepoints of the 2-bit IP-ECN field
      are as defined in [RFC3168]: Not ECT, ECT(0), ECT(1) and CE, where
      ECT stands for ECN-Capable Transport and CE stands for Congestion
      Experienced.




De Schepper, et al.       Expires 16 April 2024                 [Page 7]

Internet-Draft          Prague Congestion Control           October 2023


      A packet marked with the CE codepoint is termed 'ECN-marked' or
      sometimes just 'marked' where the context makes ECN obvious.

   CC:  Congestion Control

   CCA:  Congestion Control Algorithm

   ACK:  an ACKnowledgement, or to ACKnowledge

   EWMA:  Exponentially Weighted Moving Average

   RTT:  Round Trip Time

   Definitions of Parameters and Variables:

   MTU_BITS:  Maximum transmission unit [b]

   cwnd:  Congestion window [B]

   ssthresh:  Slow start threshold [B]

   inflight:  The amount of data that the sender has sent but not yet
      received ACKs for [B]

   p:  Steady-state probability of drop or marking []

   alpha:  EWMA of the ECN marking fraction []

   acked_sacked:  the amount of new data acknowledged by an ACK [B]

   ece_delta:  the amount of newly acknowledged data that was ECN-marked
      [B]

   ai_per_rtt:  additive increase to apply per RTT [B]

   srtt:  Smoothed round trip time [s]

   MAX_BURST_DELAY:  Maximum allowed bottleneck queuing delay due to
      segmentation offload bursts [s] (default 0.25 ms for the public
      Internet)

2.  Prague Congestion Control









De Schepper, et al.       Expires 16 April 2024                 [Page 8]

Internet-Draft          Prague Congestion Control           October 2023


2.1.  The Prague L4S Requirements

   The beneficial properties of L4S traffic (low queuing delay, etc.)
   depend on all L4S sources satisfying a set of conditions called the
   Prague L4S Requirements.  The name is after an ad hoc meeting of
   about thirty people co-located with the IETF in Prague in July 2015,
   the day after the first public demonstration of L4S.

   The meeting agreed on a list of modifications to DCTCP [RFC8257] to
   focus activity on a variant that would be safe to use over the public
   Internet.  It was suggested that this could be called TCP Prague to
   distinguish it from DCTCP.  This list was adopted by the IETF, and
   has continued to evolve (see section 4 of [RFC9331]).  The
   requirements are no longer TCP-specific, applying irrespective of
   wire-protocol (TCP, QUIC, RTP, SCTP, etc).

   This unusual start to the life of the project led to the unusual
   development process of a reference implementation that had to resolve
   a number of ambitious requirements, already known to be in tension
   [Tensions17].

   DCTCP already implements a scalable congestion control.  So most of
   the changes to make it usable over the Internet seemed trivial, some
   'merely' involving adoption of other parallel developments like
   Accurate ECN TCP feedback [I-D.ietf-tcpm-accurate-ecn] or RACK
   [RFC8985].  Others have been more challenging (e.g.  RTT-
   independence).  And others that seemed trivial became challenging
   given the complex set of bugs and behaviours that characterize
   today's Internet and a modern end system stack such as Linux.

   The more critical implementation challenges are highlighted in the
   following sections, in the hope we can prevent mistakes being
   repeated (see for instance Section 2.3.2, Section 2.4.2).  There was
   also a set of five intertwined 'bugs' - all masking each other, but
   causing unpredictable or poor performance as different code
   modifications or environments unmasked them.  A comprehensive write-
   up of the experiments that had to be designed to isolate these bugs
   is available separately [DCTCP_flaws].













De Schepper, et al.       Expires 16 April 2024                 [Page 9]

Internet-Draft          Prague Congestion Control           October 2023


   During the development process, we have unearthed fundamental aspects
   of the implementation and indeed the design of DCTCP and Prague that
   have still not caught up with the paradigm shift from existence to
   extent-based congestion response.  Some have been implemented by
   default, e.g. not suppressing additive increase for a round trip
   after a congestion event (Section 2.4.3).  Others have been
   implemented but not fully evaluated, e.g. removing the 1-2
   unnecessary round trips of lag in feedback processing (Section 3.1.3)
   and yet others are still future plans, e.g. further RTT-independence
   (Section 3.4) and exploiting combined congestion metrics in more
   cases (Section 3.2).

   The requirements are categorized into those that would impact other
   flows if not handled properly and performance optimizations that are
   important but optional from the IETF's point of view, because they
   only affect the flow itself.  The list below maps the order of the
   requirements in [RFC9331] to the order in this document (which is by
   functional categories and code status):

   Mandatory or Advisory Requirements:
      *  L4S-ECN packet identification: use of ECT(1) (Section 2.2)

      *  Accurate ECN feedback (Section 2.3.1)

      *  Reno-friendly response to a loss (Section 2.4.1)

      *  Detection of a classic ECN AQM (Section 3.3)

      *  Reduced RTT dependence (Section 2.4.4)

      *  Scaling down to a fractional window (recommended but no longer
         mandatory, see Section 3.5)

      *  Detecting loss in units of time (Section 2.3.3)

      *  Minimizing bursts (Section 2.5.1

   Optional performance optimizations:
      *  ECN-capable control packets (Section 2.2)

      *  Faster flow start (Section 3.1.1)

      *  Faster than additive increase (Section 3.1.2)

      *  Segmentation offload (Section 2.5.2)






De Schepper, et al.       Expires 16 April 2024                [Page 10]

Internet-Draft          Prague Congestion Control           October 2023


2.2.  Packet Identification

   On the public Internet, a sender using a Prague CCA MUST set the
   ECT(1) codepoint on all the packets it sends, in order to identify
   itself as an L4S-capable congestion control (Req 4.1. [RFC9331]).

   This applies whatever the transport protocol, whether TCP, QUIC, RTP,
   etc.  In the case of TCP, unlike an RFC 3168 TCP ECN transport, a
   sender can set all packets as ECN-capable, including TCP control
   packets and retransmissions [RFC8311],
   [I-D.ietf-tcpm-generalized-ecn].

   A Prague CCA SHOULD optionally be configurable to use the ECT(0)
   codepoint in private networks, such as data centres, which might be
   necessary for backward compatibility with DCTCP deployments where
   ECT(1) might already have another usage.

   Linux implementation note:

   TCP Prague in Linux kernel:  The Linux kernel was updated to allow
      the ECT(1) flag to be set from within a CCA module.  A Prague CCA
      then has full control over the ECN code point it uses at any one
      time.  In this way it asserts the use of ECT(1) (or optionally
      ECT(0)) and non-ECT when required.

2.3.  Detecting and Measuring Congestion


2.3.1.  Accurate ECN Feedback

   When feedback of ECN markings was added to TCP [RFC3168], it was
   decided not to report any more than one mark per RTT.  L4S-capable
   congestion controls need to know the extent, not just the existence
   of congestion (Req 4.2. [RFC9331]).  Recently defined transports
   (DCCP, QUIC, etc) typically already satisfy this requirement.  So
   they are dealt with separately below, while TCP and derivatives such
   as SCTP [RFC9260] are covered first.

2.3.1.1.  Accurate ECN Feedback with TCP & Derivatives

   The TCP wire protocol is being updated to allow more accurate
   feedback (AccECN [I-D.ietf-tcpm-accurate-ecn]).  Therefore, in the
   case where a sender uses a Prague CCA over TCP, whether as client or
   server:

   *  it MUST itself support AccECN;





De Schepper, et al.       Expires 16 April 2024                [Page 11]

Internet-Draft          Prague Congestion Control           October 2023


   *  to support AccECN it also has to check that its peer supports
      AccECN during the handshake.

   If the peer does not support accurate ECN feedback, the sender MUST
   fall back to a Reno-friendly CCA behaviour for the rest of the
   connection.  The non-Prague TCP sender MUST then no longer set ECT(1)
   on the packets it sends.  Note that the peer only needs to support
   AccECN; there is no need to find out whether the peer is using an
   L4S-capable congestion control.

   Note that a sending TCP client that uses a Prague CCA can set ECT(1)
   on the SYN prior to checking whether the other peer supports AccECN
   (as long as it follows the procedure in
   [I-D.ietf-tcpm-generalized-ecn] if it subsequently discovers the peer
   does not support AccECN).

   Linux implementation note:

   TCP Prague in Linux kernel:  The Linux kernel update to support
      AccECN for TCP is independent of the CCA module in use.  So the
      kernel tries to negotiate AccECN feedback whichever congestion
      control module is selected.  The Prague CCA module does an
      additional check to verify that the kernel actually does support
      AccECN, based on which Prague will either decide to proceed using
      a scalable CCA or fall back to a Classic CCA (Reno in the current
      Linux implementation).

      A system-wide option is available to enable or disable AccECN
      negotiation.  However, TCP flows using the Prague CCA module
      depend on AccECN; so they always ignore this system-wide sysctl
      and enable AccECN negotiation anyway.

2.3.1.2.  Accurate ECN Feedback with Other Modern Transports

   Transport protocols specified recently, .e.g.  DCCP [RFC4340], QUIC
   [RFC9000], are unambiguously suitable for Prague CCAs, at least
   according to the protocol specs, because they were designed from the
   start with accurate ECN feedback.  However, in practice, ECN feedback
   was omitted from the early versions of some QUIC implementations.

   At the start of a connection, a QUIC sender that wishes to use ECN
   checks incoming feedback for whether ECN correctly traversed the
   forward path.  To such a QUIC sender, a peer that has not implemented
   ECN feedback is indistinguishable from a network path that bleaches
   the ECN field.  So, in either case, the sender will disable the use
   of ECN.





De Schepper, et al.       Expires 16 April 2024                [Page 12]

Internet-Draft          Prague Congestion Control           October 2023


   Without ECN, Prague essentially turns into a Classic CCA.  For
   instance, the Apple implementation of Prague for QUIC responds to
   each loss as CUBIC would.  So without ECN, its behaviour becomes
   indistinguishable from CUBIC.  This is because Prague's response to
   each loss is Reno-friendly (as required in Section 2.4.1).

   In the case of RTP/RTCP, ECN feedback was added in [RFC6679], which
   is sufficient for a Prague CCA.  However, it is preferable to use the
   most recent improvements to ECN feedback in [RFC8888], as used in the
   implementation of the L4S variant of SCReAM [RFC8298].

2.3.2.  Moving Average of ECN Feedback

   A Prague CCA currently maintains a moving average of ECN feedback in
   a similar way to DCTCP.  This section is provided mainly because
   performance has proved to be sensitive to implementation precision in
   this area.  So first, some background is necessary.

   A Prague CCA triggers update of its moving average once per rtt_virt
   (see Section 2.4.4).  To maintain its moving average, it measures the
   fraction, frac, of ACKed packets {ToDo: add Future Work on using
   ACKed bytes instead, if decided appropriate} that carried ECN
   feedback over the previous rtt_virt.  It then updates an
   exponentially weighted moving average (EWMA) of this fraction, called
   alpha, using the following algorithm:

      alpha += g * (frac - alpha);

   where g (default 1/16) is the gain of the EWMA.

   The moving average, alpha, is initialized to 1 at the first sign of
   ECN feedback, which ensures the maximum congestion response to the
   first appearance of congestion at a bottleneck supporting ECN.

   Linux implementation notes:

   Rounding problems in DCTCP:  Alpha is a fraction between 0 and 1, and
      it needs to be represented with high resolution because the larger
      the bandwidth-delay product (BDP) of a flow, the smaller the value
      that alpha converges to (in steady state alpha = 2/cwnd).  In
      principle, Linux DCTCP maintains the moving average 'alpha' using
      the same formula as Prague CCA uses (as above).  Linux DCTCP
      represents alpha with a 10-bit integer (with resolution 1/1024).
      However, up to kernel release 3.19, Linux used integer arithmetic
      that could not reduce alpha below 15/1024.  Then it was patched so
      that any value below 16/1024 was rounded down to zero
      [patch-alpha-zero].  For a flow with a higher BDP than 128
      segments, this means that, alpha flip-flops.  Once it has flopped



De Schepper, et al.       Expires 16 April 2024                [Page 13]

Internet-Draft          Prague Congestion Control           October 2023


      down to zero DCTCP becomes unresponsive until it has built
      sufficient queue to flip up to 16/1024.  For larger BDPs, this
      causes DCTCP to induce larger sawteeth, which loses the low-
      queuing-delay and high-utilization intent of the algorithm.

   Upscaled alpha in Prague CC:  To resolve the above problem the
      implementation of TCP Prague in Linux maintains upscaled_alpha =
      alpha/g instead of alpha:

         upscaled_alpha += frac - g * upscaled_alpha;

      This technique is the same as Linux uses for the retransmission
      timer variables, srtt and mdev.  The Linux Prague CCA also uses 20
      bits for alpha.

   The above per-rtt_virt update to the moving average is currently used
   in the Linux Prague CCA.  However, another approach is being
   investigated because these per-RTT updates introduce 1--2 rounds of
   delay into the congestion response on top of the inherent round of
   feedback delay (see Section 3.1.3 in the section on variants and
   future work).

2.3.3.  Scaling Loss Detection with Flow Rate

   After an ACK leaves a gap in the sequence space, a Prague CCA is
   meant to deem that a loss has occurred using 'time-based units' (Req
   4.3. [RFC9331]).  This is in contrast to the traditional approach
   that counts a hard-coded number of duplicate ACKs, e.g. the 3 Dup-
   ACKs specified in [RFC5681].  The reason is that counting packets
   rather than time unnecessarily tightens the time within which
   parallelized links have to keep packets in sequence as flow rate
   scales over the years.

   To satsify this requirement, a Prague CCA SHOULD wait until a certain
   fraction of the RTT has elapsed before it deems that the gap is due
   to packet loss.  The reference implementation of TCP Prague in Linux
   uses RACK [RFC8985] to address this requirement.  An approach similar
   to TCP RACK is also used in QUIC.













De Schepper, et al.       Expires 16 April 2024                [Page 14]

Internet-Draft          Prague Congestion Control           October 2023


   At the start of a connection, RACK counts 3 DupACKs to detect loss
   because the initial smoothed RTT estimate can be inaccurate.  This
   would depend indirectly on time as long as the initial window (IW) is
   paced over a round trip (see Section 2.4.5).  For instance, if the
   initial window of 10 segments was paced evenly across the initial RTT
   then, in the next round, an implementation that deems there has been
   a loss after (say) 1/4 of an RTT can count 1/4 of 10 = 3 DupACKs
   (rounded up).  In subsequent rounds, if the TCP connection detects
   reordering, RACK shifts to using a fraction of the RTT for loss
   detection.

2.4.  Congestion Response Algorithm

   In congestion avoidance phase, a Prague CCA uses a similar additive
   increase multiplicative decrease (AIMD) algorithm to DCTCP, but with
   the following differences:

2.4.1.  Loss behaviour

   A Prague CCA MUST use a Reno-friendly congestion response (such as
   that of CUBIC [RFC9438] or Reno [RFC5681]) on detection of a loss
   (Req 2 in section 4.3. of [RFC9331]).  DCTCP falls back to Reno for
   the round trip after a loss, and the Linux reference implementation
   of TCP Prague currently inherits this behaviour.  On detection of
   loss, the Apple implementation of Prague uses CUBIC's behaviour
   instead of Reno's, for both the reduction after the loss and the
   subsequent growth of cwnd until the next congestion event.

   If a Prague CCA has already reduced the congestion window due to ECN
   feedback less than a round trip before it detects a loss, it MAY
   reduce the congestion window by a smaller amount due to the loss, as
   long as the reductions due to ECN and the loss, when combined, result
   in a reduction that is at least as large as the implementation
   usually makes in response to loss (e.g. 50% to emulate Reno or 30% to
   emulate CUBIC).

   See Section 3.2 for discussion of future work on congestion control
   using a combination of delay, ECN and loss.

   Linux implementation note:

   DCTCP bug prior to v5.1:  A Prague CCA cannot rely on inheriting the
      fall-back-on-loss behaviour of the DCTCP code in the Linux kernel
      prior to v5.1, due to a previous bug in the fast retransmit code
      (but not in the retransmission timeout code) [patch-loss-react].






De Schepper, et al.       Expires 16 April 2024                [Page 15]

Internet-Draft          Prague Congestion Control           October 2023


2.4.2.  Multiplicative Decrease on ECN Feedback

   A Prague CCA currently responds to ECN feedback in a similar way to
   DCTCP.  This section is provided mainly because performance has
   proved to be sensitive to implementation details in this area.  So
   the following recap of the congestion response is needed first.

   As explained in Section 2.3.2, like DCTCP, the Linux Prague CCA
   clocks its moving average of ECN-marking, alpha, once per rtt_virt
   throughout a connection.  Nonetheless, it only triggers a
   multiplicative decrease to its congestion window when it actually
   receives an ACK carrying ECN feedback.  Then it suppresses any
   further decreases for one round trip, even if it receives further ECN
   feedback.  This is termed Congestion Window Reduced or CWR state.

   A Prague CCA (also like DCTCP) ensures that the average recovery time
   remains invariant as flow rate scales (Req 4.3 of [RFC9331]) by
   making the multiplicative decrease depend on the prevailing value of
   alpha as follows:

      ssthresh = (1 - alpha/2) * cwnd;

   Linux implementation notes:

   Upscaled alpha:  With reference to the earlier discussion of integer
      arithmetic precision (Section 2.3.2), alpha = g * upscaled_alpha.

   Carry of fractional cwnd remainder:  Typically the absolute reduction
      in the window is only a small number of segments.  So, if the
      Prague CCA implementation counts the window in integer segments
      (as in the Linux reference code), delay can be made significantly
      less jumpy by tracking a fractional value, cwnd_carry, alongside
      the integer window and carrying over any fractional remainder to
      the next reduction.  Also, integer rounding bias ought to be
      removed from the multiplicative decrease calculation.
      Specifically:















De Schepper, et al.       Expires 16 April 2024                [Page 16]

Internet-Draft          Prague Congestion Control           October 2023


      #define ONE_CWND (1LL << 20)        /* Must be signed */
      #define MAX_ALPHA (1ULL << 20)

      /* On CE feedback, calculate the reduction in cwnd */
          /* Adding MAX_ALPHA to the numerator effectively adds 1/2
           *  which compensates for integer division rounding down */
          reduction = (alpha*cwnd*ONE_CWND + MAX_ALPHA) / MAX_ALPHA / 2;
          cwnd_carry -= reduction;

      /* Round reduction into whole segments and carry the remainder */
          if (cwnd_carry <= -ONE_CWND) {
              cwnd_carry += ONE_CWND;
              cwnd = max(cwnd - 1, MIN_CWND);
              ssthresh = cwnd;
          }

   In dynamic scenarios, as flows find a new operating point, alpha will
   have often tailed away to near-nothing before the onset of
   congestion.  Then DCTCP's tiny reduction followed by no further
   response for a round is precisely the wrong way for a CCA to respond.
   A solution to this problem is being evaluated as part of the work
   already mentioned to improve Prague's responsiveness (see
   Section 3.1.3 in the section on variants and future work).

2.4.3.  Additive Increase and ECN Feedback

   Unlike DCTCP, a Prague CCA does not suppress additive increase for
   one round trip after a congestion window reduction (i.e., while in
   CWR state).  Instead, a Prague CCA applies additive increase
   irrespective of its CWR state, but only for bytes that have been
   ACK'd without ECN feedback.  Specifically, on each ACK,

       cwnd += (acked_sacked - ece_delta) * ai_per_rtt / cwnd;

   where:

      acked_sacked is the number of new bytes acknowledged by the ACK;

      ece_delta is the number of newly acknowledge ECN-marked bytes;

      ai_per_rtt is a scaling factor that is typically 1 SMSS except for
      small RTTs (see Section 2.4.4)

   Superficially, the classical suppression of additive increase for the
   round after a decrease seems to make sense.  However, DCTCP and
   Prague are designed to induce an average of 2 congestion marks per
   RTT in steady state, which leaves very little space for any increase
   between the end of one round of CWR and the next mark.  In tests,



De Schepper, et al.       Expires 16 April 2024                [Page 17]

Internet-Draft          Prague Congestion Control           October 2023


   when a test version of a Prague CCA is configured to completely
   suppress additive increase during CWR (like Reno and DCTCP), its
   sawteeth become more irregular, which is its way of making some
   decreases large enough to open up enough space for an increase.  This
   irregularity tends to reduce link utilization.  Therefore, the Prague
   CCA has been modified to continue additive increase irrespective of
   CWR state.

   Nonetheless, rather than continue additive increase regardless of
   congestion, it is safer to only increase on those ACKs that do not
   feed back congestion.  This approach reduces additive increase as the
   marking probability increases, which tends to keep the marking level
   unsaturated (below 100%) (see Section 3.1 of [Tensions17]).  Under
   stable conditions, Prague's congestion window then becomes
   proportional to (1-p)/p, rather than 1/p.

   See also 'Faster than Additive Increase' (Section 3.1.2)

2.4.4.  Reduced RTT-Dependence

   The window-based AIMD described so far was inherited from Reno via
   DCTCP.  When many long-running Reno flows share a link, their
   relative packet rates become roughly inversely proportional to RTT.
   Then a flow with very small RTT will dominate any flows with larger
   RTTs.

   Queuing delay sets a lower limit to the smallest possible RTT.  So,
   prior to the extremely low queuing delay of L4S, extreme cases of RTT
   dependence had never been apparent.  Now that L4S has removed most of
   the queuing delay, we have to address the root-cause of RTT-
   dependence, which a Prague CCA is required to do, at least when the
   RTT is small (see the 'Reduced RTT bias' aspect of Req
   4.3. [RFC9331]).  Here, a small RTT is defined as below "typical RTTs
   expected in the intended deployment scenario".

   The Linux reference Prague CCA reduces RTT bias by using a virtual
   RTT (rtt_virt) rather than the actual smoothed RTT (srtt) for all
   three of: i) the period of additive window increase; ii) the EWMA
   update period; and iii) the duration of CWR state after a decrease.
   rtt_virt is calculated as a function of the actual smoothed RTT,
   chosen so that, when the srtt is high, the virtual RTT is essentially
   the same; but for lower actual RTTs, the virtual RTT is increasingly
   larger than the actual RTT.  Example functions that implementations
   might use for the virtual RTT are:

      rtt_virt = max(srtt, RTT_VIRT_MIN);

      rtt_virt = srtt + AdditionalRTT;



De Schepper, et al.       Expires 16 April 2024                [Page 18]

Internet-Draft          Prague Congestion Control           October 2023


   where the parameters RTT_VIRT_MIN or AdditionalRTT would be set for a
   particular deployment environment.

   The Linux implementation of Prague uses the first example and for the
   public Internet sets RTT_VIRT_MIN=25ms.  Thus, Linux Prague defines
   rtt_virt = max(srtt, 25ms), which addresses the Prague requirement
   for Reduced RTT-Dependence when the RTT is smaller than typical
   public Internet RTTs.

   As the actual window (cwnd) is still sent within 1 actual RTT, we
   also need to use a (conceptual) virtual window, cwnd_virt.  For
   instance, if rtt_virt = 25 ms then, when the actual RTT is 5 ms,
   there are rtt_virt/srtt = 5 times more packets in cwnd_virt, than in
   the actual window, cwnd, because cwnd_virt spans 5 actual round
   trips.  We define M as the ratio rtt_virt/srtt.

   In Reno or DCTCP, additive increase is implemented by dividing the
   desired increase of 1 segment per round over the cwnd packets in the
   round.  This requires an increase of 1/cwnd per packet.

   In the Linux implementation of TCP Prague, the aim is to increase the
   reference window by 1 segment over a virtual RTT.  However, in
   practice the increase is applied to the actual window, cwnd, which is
   M times smaller than cwnd_virt.  So cwnd has to be increased by only
   1/M segments over rtt_virt.  But again, in practice, the increase is
   applied over an actual window of packets spanning an actual RTT,
   which is also M times smaller than the virtual RTT.  So the desired
   increase in cwnd is only 1/M^2 segments over an actual round trip
   containing cwnd packets.  Therefore, the increase in cwnd per packet
   has to be (1/M^2) * (1/cwnd).

   Unless a flow lasts long enough for rates to converge, aiming for
   equal rates will not be relevant.  So, in the Linux implementation of
   Prague, the Reduced RTT-Dependence algorithm only comes into effect
   after D rounds, where D is configurable (current default 500).
   Continuing the previous example, if actual srtt=5 ms and rtt_virt =
   25 ms, Prague would use the regular RTT-dependent algorithm for the
   first 500*5ms = 2.5s.  Then it would start to converge to more equal
   rates using its Reduced RTT-Dependence algorithm.  If the actual RTT
   were higher (e.g. 20ms), it would stay in the regular RTT-dependent
   mode for longer (500 rounds = 10s), but this would be mitigated by
   the actual RTT it uses at the start being closer to the virtual RTT
   that it eventually uses (20ms and 25ms resp.).

   This approach prevents reduced RTT-dependence from making the flow
   less responsive at start-up and ensures that its early throughput
   share is based on its actual RTT.  The benefit is that short flows
   (mice) give themselves priority over longer flows (elephants), and



De Schepper, et al.       Expires 16 April 2024                [Page 19]

Internet-Draft          Prague Congestion Control           October 2023


   shorter RTTs will still converge faster than longer RTTs.
   Nonetheless, the throughput still converges to equal rates after D
   rounds.

   It is planned to reset the algorithm to the regular RTT-dependent
   behaviour after an idle, not just at flow start, as discussed under
   Future Work in Section 3.4.

   Section 3.4 also discusses extending the reduction in RTT-dependence
   to longer RTTs than RTT_VIRT_MIN (i.e. longer than 25ms).  The
   current Linux Prague implementation does not support this.

2.4.5.  Flow Start or Restart

   Currently the Linux reference implementation of TCP Prague uses the
   standard Linux slow start code.  Slow start is exited once a single
   mark is detected.

   When other flows are actively filling the link, regular marks are
   expected, causing slow start of new flows to end prematurely.  This
   is clearly not ideal, so other approaches are being worked on (see
   Section 3.1.1).  However, slow start has been left as the default
   until a properly matured solution is completed.

2.5.  Packet Sending


2.5.1.  Packet Pacing

   A Prague CCA MUST pace the packets it sends to avoid the queuing
   delay and under-utilization that would otherwise be caused by bursts
   of packets that can occur, for example, when restarting from idle or
   a jump in the acknowledgement number opens up cwnd.  Prague does this
   in a similar way to the base Linux TCP stack, by spacing out the
   window of packets evenly over the round trip time, using the
   following calculation of the pacing rate [b/s]:

      pacing_rate = MTU_BITS * max(cwnd, inflight) / srtt;

   During slow start, as in the base Linux TCP stack, Prague factors up
   pacing_rate by 2, so that it paces out packets twice as fast as they
   are acknowledged.  This keeps up with the doubling of cwnd, but still
   prevents bursts in response to any larger transient jumps in cwnd.

       if (cwnd < ssthresh / 2)
           pacing_rate *= 2;





De Schepper, et al.       Expires 16 April 2024                [Page 20]

Internet-Draft          Prague Congestion Control           October 2023


   During congestion avoidance, the Linux TCP Prague implementation does
   not factor up pacing_rate at all.  This contrasts with the base Linux
   TCP stack, which currently factors up pacing_rate by a ratio
   parameter set to 1.2.  The developers of the base Linux stack
   confirmed that this factor of 1.2 was only introduced in case it
   improved performance, but there were no scenarios where it was known
   to be needed.  In testing of Prague, this factor was found to cause
   queue delay spikes whenever cwnd jumped more than usual.  And
   throughput was no worse without it.  So it was removed from the Linux
   TCP Prague CCA.

   A Prague CCA can use alternatives to the traditional slow-start
   algorithm that use different pacing (see Section 2.4.5).

2.5.2.  Segmentation Offload

   In the absence of hardware pacing, it becomes increasingly difficult
   for a machine to scale to higher flow rates unless it is allowed to
   send packets in larger bursts, for instance using segmentation
   offload.  Happily, as flow rate scales up, proportionately more
   packets can be allowed in a burst for the same amount of queuing
   delay at the bottleneck.

   Therefore, a Prague CCA sends packets in a burst as long as it will
   not induce more than MAX_BURST_DELAY of queuing at the bottleneck.
   From this constant and the current pacing_rate, it calculates how
   many MTU-sized packets to allow in a burst:

      max_burst = pacing_rate * MAX_BURST_DELAY / MTU_BITS

   The current default in the Linux TCP Prague for MAX_BURST_DELAY is
   250us which supports marking thresholds starting from about 500us
   without underutilization.  This approach is similar to that in the
   Linux TCP stack, except there MAX_BURST_DELAY is 1ms.

3.  Variants and Future Work


3.1.  Getting up to Speed Faster

   Appendix A.2. of [RFC9331] outlines the performance optimizations
   needed when transplanting DCTCP from a DC environment to a wide area
   network.  The following subsections address two of those points:
   faster flow startup and faster than additive increase.  Then
   Section 3.1.3 covers the flip side, in which established flows have
   to yield faster to make room, otherwise queuing will result.





De Schepper, et al.       Expires 16 April 2024                [Page 21]

Internet-Draft          Prague Congestion Control           October 2023


3.1.1.  Flow Start (or Restart)

   For faster flow start, two approaches are currently being
   investigated in parallel:

   Modified Slow Start:  The traditional exponential slow start can be
      modified both at the start and the end, with the aim of reducing
      the risk of queuing due to bursts and overshoot:

      Pacing IW:  A Prague CCA can use an initial window of 10 (IW10
         [RFC6928]), but pacing of this Initial Window is recommended to
         try to avoid the pulse of queuing that could otherwise occur.
         Pacing IW10 also spreads the ACKs over the round trip so that
         subsequent rounds consist of ten subsets of packets (with 2, 4,
         8 etc.  per round in each subset), rather than a single set
         with 20, 40, 80 etc. in each round.  With IW paced, if a queue
         builds during a round (e.g. due to other unexpected traffic
         arriving) it can drain in the gap before the next subset,
         rather than the whole set backing up into a much larger queue.
         As smoothed RTT is unknown or inaccurate at the start of a
         flow, an implementation can pace IW over a fraction of the
         initial smoothed RTT (perhaps also clamped between hard-coded
         sanity limits).  The implementation could also initialize SRTT
         with a value it had previously cached per destination (as long
         as it is sufficiently fresh).  The safety factor could depend
         on whether a cached value was available and how recent it was.

         In the Linux reference implementation of TCP Prague, IW pacing
         can be optionally enabled, but it is off by default, because it
         is yet to be fully evaluated.  It currently paces IW over half
         the initial smoothed round trip time (SRTT) measured during the
         handshake.  SRTT is halved because the RTT often reduces after
         the initial handshake.  For example: i) some CDNs move the flow
         to a closer server after establishment; ii) the initial RTT
         from a server can include the time to wake a sleeping handset
         radio; iii) some uplink technologies take a link-level round
         trip to request a scheduling slot.

         It is also planned to exploit any cached knowledge of the path
         RTT to improve the initial estimate, for instance using the
         Linux per-destination cache.  It is also planned to allow the
         application to give an RTT hint (by setting sk_max_pacing_rate
         in Linux) if the developer has reason to believe that the
         application has a better estimate.

      Exiting slow start more gracefully:  In the wide area Internet (in





De Schepper, et al.       Expires 16 April 2024                [Page 22]

Internet-Draft          Prague Congestion Control           October 2023


         contrast to data centres), bottleneck access links tend to have
         much less capacity than the line rate of the sender.  With a
         shallow immediate ECN threshold at this bottleneck, the
         slightest burst can tend to induce an ECN mark, which
         traditionally causes slow start to exit.  A more gradual exit
         is being investigated for a Prague CCA using the extent of
         marking, not just the existence of a single mark.  This will be
         more consistent with the extent-based marking that scalable
         congestion controls use during congestion avoidance.  Delay
         measurements (similar to Hystart++ [RFC9406]) can also be used
         to complement the ECN signals.

   Paced Chirping:  In this approach, the aim is to both increase more
      rapidly than exponential slow-start and to greatly reduce any
      overshoot.  It is primarily a delay-based approach, but the aim is
      also to exploit ECN signals when present (while not forgetting
      loss either).  Therefore Paced Chirping is generally usable for
      any congestion control - not solely for a Prague CCA and L4S.

      Instead of only aiming to detect capacity overshoot at the end of
      flow-start, brief trains of rapidly decreasing inter-packet
      spacing called chirps are used to test many rates with as few
      packets and as little load as possible.  A full description is
      beyond the scope of this document.  [LinuxPacedChirping]
      introduces the concepts and the code as well as citing the main
      papers on Paced Chirping.

      Paced chirping works well over continuous links such as Ethernet
      and DSL.  But better averaging and noise filtering are necessary
      over discontinuous link technologies such as WiFi, LTE cellular
      radio, passive optical networks (PON) and data over cable
      (DOCSIS).  This is the current focus of this work.

      The current Linux implementation of TCP Prague does not include
      Paced Chirping, but research code is available separately in Linux
      and ns3. it is accessible via the L4S landing page [L4S-home].

3.1.2.  Faster than Additive Increase

   A Prague CCA has a startup phase and congestion avoidance phase like
   traditional CCAs.  In steady-state during congestion avoidance, like
   all scalable congestion controls, it induces frequent ECN marks, with
   the same average recovery time between ECN marks, no matter how much
   the flow rate scales.

   If available capacity suddenly increases, e.g. other flow(s) depart
   or the link capacity increases, these regular ECN marks will stop.
   Therefore after a few rounds of silence (no ECN marks) in congestion



De Schepper, et al.       Expires 16 April 2024                [Page 23]

Internet-Draft          Prague Congestion Control           October 2023


   avoidance phase, a Prague CCA can assume that available capacity has
   increased, and switch to using the techniques from its startup phase
   (Section 3.1.1) to rapidly find the new, faster operating point.
   Then it can shift back into its congestion avoidance behaviour.

   That is the theory.  But, as explained in Section 3.1.1, the startup
   techniques, specifically paced chirping, are still being developed
   for discontinuous link types.  Once the startup behaviour is
   available, the Linux implementation of a Prague CCA will also have a
   faster than additive increase behaviour.  S.3.2.3 of [PragueLinux])
   gives a brief preview of the performance of this approach over an
   Ethernet link type in ns3.

3.1.3.  Remove Lag in Congestion Response

   To keep queuing delay low, new flows can only push in fast if
   established flows yield fast.  It has recently been realized that the
   design of a Prague EWMA and congestion response introduces 1-2 rounds
   of lag (on top of the inherent round of feedback delay due to the
   speed of light).  These lags were inherited from the design of DCTCP
   (see Section 2.3.2 and Section 2.4.2), where a couple of extra
   hundred microseconds was less noticeable.  But congestion control in
   the wide area Internet cannot afford up to 2 rounds trips of extra
   lag.

   To be clear, lag means delay before any response at all starts.  That
   is qualititatively different from the smoothing gain of an EWMA,
   which /reduces/ the response by the gain factor (1/16 by default) in
   case a change in congestion does not persist.  Smoothing gain can
   always be increased.  But 1-2 rounds of lag means that, when a new
   flow tries to push in, the sender of an established flow will not
   respond /at all/ for 1-2 rounds after it first receives congestion
   feedback.

   A Prague CCA spends the first round trip of this lag gathering
   feedback to measure frac before it is input into the EWMA algorithm
   (see Section 2.3.2).  Then there is up to one further round of delay
   because the implementations of DCTCP and Prague did not fully adopt
   the paradigm shift to extent-based marking - the timing of the
   decrease is still based on Reno.

   Both Reno and DCTCP/Prague respond immediately on the first sign of
   congestion.  Reno's response is large, so it waits a round in CWR
   state to allow the response to take effect.  DCTCP's response is tiny
   (extent-based), but then it still waits a round in CWR state.  So it
   does next-to-nothing for a round.





De Schepper, et al.       Expires 16 April 2024                [Page 24]

Internet-Draft          Prague Congestion Control           October 2023


   New EWMA and response algorithms to remove these 1-2 extra rounds of
   lag are described in [PerAckEWMA].  They have been implemented in
   Linux and an iterative process of evaluation and redesign is in
   progress.  The EWMA is updated per-ACK, but it still changes as if it
   is clocked per round trip.  The congestion response is still
   triggered by the first indication of ECN feedback, but it proceeds
   over the subsequent round trip so that it can take into account
   further incoming feedback as the EWMA evolves.  The reduction is
   applied per-ACK but sized to result as if it had been a single
   response per round trip.

3.2.  Combining Congestion Metrics

   Ultimately, it would be preferable to take an integrated approach and
   use a combination of ECN, loss and delay metrics to drive congestion
   control.  For instance, using a downward trend in ECN marking and/or
   delay as a heuristic to temper the response to loss.  Such ideas are
   not in the immediate plans for the Linux TCP Prague, but some more
   specific ideas are highlighted in the following subsections.

3.2.1.  ECN with Loss

   If the bottleneck is ECN-capable, a loss due to congestion is very
   likely to have been preceded by a period of ECN marking.  When the
   current Linux TCP Prague CCA detects a loss, like DCTCP, it halves
   cwnd, even if it has already reduced cwnd in the same round trip due
   to ECN marking.  This double reduction can end up factoring down cwnd
   to as little as 1/4 in one round trip.  This is not necessarily
   detrimental (experimentation will tell), but, if necessary, the
   response to loss can be factored down, so that the combination of
   both responses is the same as the reduction that would have occurred
   due to loss.

   Specifically, on a loss while in CWR state following an ECN
   reduction, for an implementation that uses Reno response, it would be
   possible to use a decrease factor of 1/(2-alpha), which would
   compound with the previous decrease factor of (1-alpha/2) to result
   in a factor of: (1 - alpha/2) / (2-alpha)) = 1/2.  In integer
   arithmetic, this division would be possible but relatively expensive.
   A less expensive alternative would be a decrease factor of
   (2+alpha)/4, which approximates to a compounded decrease factor of
   1/2 for typical low values of alpha, even up to 30%. The compound
   decrease factor is never greater than 1/2 and in the worst case, if
   alpha were 100%, it is 3/8.

   If an implementation uses a CUBIC response on loss after an ECN
   reduction in the same round trip, in response to both ECN and loss it
   would not be appropriate to always aim for a combined reduction to



De Schepper, et al.       Expires 16 April 2024                [Page 25]

Internet-Draft          Prague Congestion Control           October 2023


   70%. This is because Prague's response to ECN alone can reduce cwnd
   to as little as 50%, so aiming for 70% would perversely require cwnd
   to _increase_ on a loss in the same round as ECN marking.
   Experimentation is needed, but an initial proposal would be a
   multiplicative decrease factor of (2+alpha)/3.  This would never
   result in an increase on loss.  It would result in a combined
   reduction factor of about 2/3 (i.e. almost 70%) if alpha was low,
   rising to a combined reduction factor of 1/2 as alpha tends towards
   100%.

3.2.2.  ECN with Delay

   Section 3.1.2 described the plans to shift between using ECN when
   close to the operating point and using delay by injecting paced
   chirps to find a new operating after the ECN signal goes silent for a
   few rounds.  Paced chirping shifts more slowly to the new operating
   point the more noise there is in the delay measurements.  Work is
   ongoing on treating any ECN marking as a complementary metric.  The
   resulting less noisy combined metric should then allow the controller
   to shift more rapidly to each new operating point.

   An alternative would be to combine ECN with the BBR approach, which
   induces a much less noisy delay signal by using less frequent but
   more pronounced delay spikes.  The approach currently being taken is
   to adapt the chirp length to the degree of noise, so the chirps only
   become longer and/or more pronounced when necessary, for instance
   when faced with a discontinuous link technology such as WiFi.  With
   multiple chirps per round, the noise can still be filtered out by
   averaging over them all, rather than trying to remove noise from each
   spike.  This keeps the 'self-harm' to the minimum necessary, and
   ensures that capacity is always being sampled, which removes the risk
   of going stale.

3.3.  Fall-Back on Classic ECN

   The implementation of the TCP Prague CCA in Linux includes an
   algorithm to detect a Classic ECN AQM and fall back to Reno as a
   result, as required by the 'Coexistence with Classic ECN' aspect of
   the Prague Req 4.3. [RFC9331].

   The algorithm currently used (v2) is relatively simple, but rather
   than describe it here, full rationale, pseudocode and explanation can
   be found in the technical report about it [ecn-fallback].  This also
   includes a selection of the evaluation results and a link to
   visualizations of the full results online.  The current algorithm
   nearly always detects a Classic ECN AQM, and in the majority of the
   wide range of scenarios tested it is good at detecting an L4S AQM.
   However, it wrongly identifies an L4S AQM as Classic in a significant



De Schepper, et al.       Expires 16 April 2024                [Page 26]

Internet-Draft          Prague Congestion Control           October 2023


   minority of cases when the link rate is low, or the RTT is high.  The
   report gives ideas on how to improve detection in these scenarios,
   but in the mean time the algorithm has been disabled by default.

   Recently, the report cited above has been updated to include new
   ideas on other ways to distinguish Classic from L4S AQMs.  The
   interested reader can access it themselves, so this living document
   will not be further summarized here.

3.4.  Further Reduced RTT-Dependence

   The algorithm to reduce RTT dependence is only relevant for long-
   running flows.  So in the current TCP Prague implementation it
   remains disabled for a certain number of round trips after the start
   of a flow, as explained in Section 2.4.4.  Instead, it would be
   possible to make rtt_virt gradually move from the actual RTT to the
   target virtual RTT, or perhaps depend on other parameters of the
   flow.  Nonetheless, just switching in the algorithm after a number of
   rounds works well enough.  It is planned to also disable the
   algorithm for a similar duration if a flow becomes idle then
   restarts, but this is yet to be evaluated.

   Prague Req 4.3. in [RFC9331]) only requires reduced RTT bias "in the
   range between the minimum likely RTT and typical RTTs expected in the
   intended deployment scenario".  Nonetheless, in future it would be
   preferable to be able to reduce the RTT bias for high RTT flows as
   well.

   If a step AQM is used, the congestion episodes of flows with
   different RTTs tend to synchronize, which exacerbates RTT bias.  To
   prevent this two candidate approaches will need to be investigated:
   i) It might be sufficient to deprecate step AQMs for L4S (they are
   not the preferred recommendation in [RFC9332]); or ii) the virtual
   RTT approach of Section 2.4.4 might be usable for higher than typical
   RTTs as well as lower.  In this latter case, (srtt/rtt_virt)^2
   segments would need to be added to the window per actual RTT.  The
   current TCP Prague implementation does not support this faster AI for
   RTTs longer than RTT_VIRT_MIN (25ms), due to the expected (but
   unverified) impact on latency overshoot and responsiveness.

3.5.  Scaling Down to Fractional Windows

   A modification to v5.0 of the Linux TCP stack that scales down to
   sub-packet windows is available for research purposes via the L4S
   landing page [L4S-home].  The L4S Prague Requirements in section 4.3
   of [RFC9331] recommend but no longer mandate scaling down to sub-
   packet windows.  This is because becoming unresponsive at a minimum
   window is a tradeoff between protecting against other unresponsive



De Schepper, et al.       Expires 16 April 2024                [Page 27]

Internet-Draft          Prague Congestion Control           October 2023


   flows and the extra queue you induce by becoming unresponsive
   yourself.  So this code is not maintained as part of the Linux
   implementation of TCP Prague.

   Firstly, the stack has to be modifed to maintain a fractional
   congestion window.  Then, because the ACK clock cannot work below 1
   packet per RTT, the code sets the time to send each packet, then
   readjusts the timing as each ACK arrives (otherwise any queuing
   accumulates a burst in subsequent rounds).  Also, additive increase
   of one segment does not scale below a 1-segment window.  So instead
   of a constant additive increase, the code uses a logarithmically
   scaled additive increase that slowly adapts the additive increase
   constant to the slow start threshold.  Despite these quite radical
   changes, the diff is surprisingly small.  The design and
   implementation is explained in [Ahmed19], which also includes
   evaluation results.

4.  IANA Considerations

   This specification contains no IANA considerations.

5.  Security Considerations

   Section 3.5 on scaling down to fractional windows discusses the
   tradeoff in becoming unresponsive at a minium window, which causes a
   queue to build (harm to self and to others) but protects oneself
   against other unresponsive flows (whether malicious or accidental).

   This draft inherits the security considerations discussed in
   [RFC9331] and in the L4S architecture [RFC9330].  In particular, the
   self-interest incentive to be responsive and minimize queuing delay,
   and protections against those interested in disrupting the low
   queuing delay of others.

6.  References

6.1.  Normative References

   [I-D.ietf-tcpm-accurate-ecn]
              Briscoe, B., KÃ¼hlewind, M., and R. Scheffenegger, "More
              Accurate Explicit Congestion Notification (ECN) Feedback
              in TCP", Work in Progress, Internet-Draft, draft-ietf-
              tcpm-accurate-ecn-26, 24 July 2023,
              <https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-
              accurate-ecn-26>.






De Schepper, et al.       Expires 16 April 2024                [Page 28]

Internet-Draft          Prague Congestion Control           October 2023


   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
              of Explicit Congestion Notification (ECN) to IP",
              RFC 3168, DOI 10.17487/RFC3168, September 2001,
              <https://www.rfc-editor.org/info/rfc3168>.

   [RFC8311]  Black, D., "Relaxing Restrictions on Explicit Congestion
              Notification (ECN) Experimentation", RFC 8311,
              DOI 10.17487/RFC8311, January 2018,
              <https://www.rfc-editor.org/info/rfc8311>.

   [RFC9331]  De Schepper, K. and B. Briscoe, Ed., "The Explicit
              Congestion Notification (ECN) Protocol for Low Latency,
              Low Loss, and Scalable Throughput (L4S)", RFC 9331,
              DOI 10.17487/RFC9331, January 2023,
              <https://www.rfc-editor.org/info/rfc9331>.

6.2.  Informative References

   [Ahmed19]  Ahmed, A.S., "Extending TCP for Low Round Trip Delay",
              Masters Thesis, Uni Oslo , August 2019,
              <https://www.duo.uio.no/handle/10852/70966>.

   [DCTCP_flaws]
              Misund, J. and B. Briscoe, "Disentangling Flaws in Linux
              DCTCP", arXiv Technical Report 211.07581 [cs.NI], November
              2022, <https://arxiv.org/abs/2211.07581>.

   [ecn-fallback]
              Briscoe, B. and A.S. Ahmed, "TCP Prague Fall-back on
              Detection of a Classic ECN AQM", bobbriscoe.net Technical
              Report TR-BB-2019-002, April 2020,
              <https://arxiv.org/abs/1911.00710>.

   [I-D.ietf-tcpm-generalized-ecn]
              Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit
              Congestion Notification (ECN) to TCP Control Packets",
              Work in Progress, Internet-Draft, draft-ietf-tcpm-
              generalized-ecn-12, 26 July 2023,
              <https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-
              generalized-ecn-12>.

   [L4S-home] "L4S: Ultra-Low Queuing Delay for All",
              <https://riteproject.eu/dctth/#code>.



De Schepper, et al.       Expires 16 April 2024                [Page 29]

Internet-Draft          Prague Congestion Control           October 2023


   [linux-code]
              "Linux kernel tree with L4S patches",
              <https://github.com/L4STeam/linux>.

   [LinuxPacedChirping]
              Misund, J. and B. Briscoe, "Paced Chirping - Rethinking
              TCP start-up", Proc. Linux Netdev 0x13 , March 2019,
              <https://www.netdevconf.org/0x13/session.html?talk-chirp>.

   [patch-alpha-zero]
              Shewmaker, A. G., "tcp: allow dctcp alpha to drop to
              zero", Linux GitHub patch; Commit: c80dbe0, 23 October
              2015,
              <https://github.com/torvalds/linux/commits/master/net/
              ipv4/tcp_dctcp.c>.

   [patch-loss-react]
              De Schepper, K., "tcp: Ensure DCTCP reacts to losses",
              Linux GitHub patch; Commit: aecfde2, 4 April 2019,
              <https://github.com/torvalds/linux/commits/master/net/
              ipv4/tcp_dctcp.c>.

   [PerAckEWMA]
              Briscoe, B., "Improving DCTCP/Prague Congestion Control
              Responsiveness", Technical Report TR-BB-2020-002, 20
              January 2021, <https://arxiv.org/abs/2101.07727>.

   [PragueLinux]
              Briscoe, B., De Schepper, K., Albisser, O., Misund, J.,
              Tilmans, O., KÃ¼hlewind, M., and A.S. Ahmed, "Implementing
              the `TCP Prague' Requirements for Low Latency Low Loss
              Scalable Throughput (L4S)", Proc. Linux Netdev 0x13 ,
              March 2019, <https://www.netdevconf.org/0x13/
              session.html?talk-tcp-prague-l4s>.

   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
              RFC 3649, DOI 10.17487/RFC3649, December 2003,
              <https://www.rfc-editor.org/info/rfc3649>.

   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram
              Congestion Control Protocol (DCCP)", RFC 4340,
              DOI 10.17487/RFC4340, March 2006,
              <https://www.rfc-editor.org/info/rfc4340>.

   [RFC5033]  Floyd, S. and M. Allman, "Specifying New Congestion
              Control Algorithms", BCP 133, RFC 5033,
              DOI 10.17487/RFC5033, August 2007,
              <https://www.rfc-editor.org/info/rfc5033>.



De Schepper, et al.       Expires 16 April 2024                [Page 30]

Internet-Draft          Prague Congestion Control           October 2023


   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
              <https://www.rfc-editor.org/info/rfc5681>.

   [RFC6679]  Westerlund, M., Johansson, I., Perkins, C., O'Hanlon, P.,
              and K. Carlberg, "Explicit Congestion Notification (ECN)
              for RTP over UDP", RFC 6679, DOI 10.17487/RFC6679, August
              2012, <https://www.rfc-editor.org/info/rfc6679>.

   [RFC6928]  Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis,
              "Increasing TCP's Initial Window", RFC 6928,
              DOI 10.17487/RFC6928, April 2013,
              <https://www.rfc-editor.org/info/rfc6928>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

   [RFC8298]  Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation
              for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December
              2017, <https://www.rfc-editor.org/info/rfc8298>.

   [RFC8888]  Sarker, Z., Perkins, C., Singh, V., and M. Ramalho, "RTP
              Control Protocol (RTCP) Feedback for Congestion Control",
              RFC 8888, DOI 10.17487/RFC8888, January 2021,
              <https://www.rfc-editor.org/info/rfc8888>.

   [RFC8985]  Cheng, Y., Cardwell, N., Dukkipati, N., and P. Jha, "The
              RACK-TLP Loss Detection Algorithm for TCP", RFC 8985,
              DOI 10.17487/RFC8985, February 2021,
              <https://www.rfc-editor.org/info/rfc8985>.

   [RFC9000]  Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based
              Multiplexed and Secure Transport", RFC 9000,
              DOI 10.17487/RFC9000, May 2021,
              <https://www.rfc-editor.org/info/rfc9000>.

   [RFC9260]  Stewart, R., TÃ¼xen, M., and K. Nielsen, "Stream Control
              Transmission Protocol", RFC 9260, DOI 10.17487/RFC9260,
              June 2022, <https://www.rfc-editor.org/info/rfc9260>.






De Schepper, et al.       Expires 16 April 2024                [Page 31]

Internet-Draft          Prague Congestion Control           October 2023


   [RFC9330]  Briscoe, B., Ed., De Schepper, K., Bagnulo, M., and G.
              White, "Low Latency, Low Loss, and Scalable Throughput
              (L4S) Internet Service: Architecture", RFC 9330,
              DOI 10.17487/RFC9330, January 2023,
              <https://www.rfc-editor.org/info/rfc9330>.

   [RFC9332]  De Schepper, K., Briscoe, B., Ed., and G. White, "Dual-
              Queue Coupled Active Queue Management (AQM) for Low
              Latency, Low Loss, and Scalable Throughput (L4S)",
              RFC 9332, DOI 10.17487/RFC9332, January 2023,
              <https://www.rfc-editor.org/info/rfc9332>.

   [RFC9406]  Balasubramanian, P., Huang, Y., and M. Olson, "HyStart++:
              Modified Slow Start for TCP", RFC 9406,
              DOI 10.17487/RFC9406, May 2023,
              <https://www.rfc-editor.org/info/rfc9406>.

   [RFC9438]  Xu, L., Ha, S., Rhee, I., Goel, V., and L. Eggert, Ed.,
              "CUBIC for Fast and Long-Distance Networks", RFC 9438,
              DOI 10.17487/RFC9438, August 2023,
              <https://www.rfc-editor.org/info/rfc9438>.

   [Tensions17]
              Briscoe, B. and K. De Schepper, "Resolving Tensions
              between Congestion Control Scaling Requirements", Simula
              Technical Report TR-CS-2016-001; arXiv:1904.07605, July
              2017, <https://arxiv.org/abs/1904.07605>.

Comments and Contributions Solicited

   This section is to be removed before publishing as an RFC.

   Comments and questions are encouraged and very welcome.  They can be
   addressed to the IRTF Internet Congestion Control Research Group's
   mailing list <iccrg@irtf.org>, and/or to the authors via <draft-
   briscoe-iccrg-congestion-control@ietf.org>.

   Contributions of design ideas and/or code are also encouraged and
   welcome.  During the drafting process, the intention is to gather
   experience into this document from a wider set of Prague congestion
   control implementations.

Acknowledgements

   Thanks to Neal Cardwell for review comments.

   The following contributed implementations and evaluations that
   validated and helped to improve this specification:



De Schepper, et al.       Expires 16 April 2024                [Page 32]

Internet-Draft          Prague Congestion Control           October 2023


      Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> of Nokia
      Bell Labs, Belgium, prepared and maintains the Linux
      implementation of TCP Prague.

      Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> of Nokia
      Bell Labs, Belgium, contributed to the Linux implementation of TCP
      Prague.

      Joakim Misund <joakim.misund@gmail.com> of Uni Oslo, Norway, wrote
      the Linux paced chirping code.

      Asad Sajjad Ahmed <me@asadsa.com>, Independent, Norway, wrote the
      Linux code that maintains a sub-packet window.

      Vidhi Goel <vidhi_goel@apple.com> of Apple Inc, Cupertino, wrote
      and maintains the Apple implmentation of QUIC Prague.

   Bob Briscoe's contribution was part-funded by the Comcast Innovation
   Fund and part-funded by Apple Inc. The views expressed here are
   solely those of the authors.

Authors' Addresses

   Koen De Schepper
   Nokia Bell Labs
   Antwerp
   Belgium
   Email: koen.de_schepper@nokia.com
   URI:   https://www.bell-labs.com/usr/koen.de_schepper


   Olivier Tilmans
   Nokia Bell Labs
   Antwerp
   Belgium
   Email: olivier.tilmans@nokia-bell-labs.com


   Bob Briscoe (editor)
   Independent
   United Kingdom
   Email: ietf@bobbriscoe.net
   URI:   http://bobbriscoe.net/








De Schepper, et al.       Expires 16 April 2024                [Page 33]

Internet-Draft          Prague Congestion Control           October 2023


   Vidhi Goel
   Apple Inc
   Cupertino,
   United States
   Email: vidhi_goel@apple.com














































De Schepper, et al.       Expires 16 April 2024                [Page 34]