Internet DRAFT - draft-zhuang-tsvwg-open-cc-architecture

draft-zhuang-tsvwg-open-cc-architecture







TSVWG                                                          Y. Zhuang
Internet-Draft                                                    W. Sun
Intended status: Informational                                    L. Yan
Expires: May 6, 2020                       Huawei Technologies Co., Ltd.
                                                        November 3, 2019


  An Open Congestion Control Architecture for high performance fabrics
               draft-zhuang-tsvwg-open-cc-architecture-00

Abstract

   This document describes an open congestion control architecture of
   high performance fabrics for the cloud operators and algorithm
   developers to deploy or develop new congestion control algorithms as
   well as make appropriate configurations for traffics on smart NICs in
   a more efficient and flexible way.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on May 6, 2020.

Copyright Notice

   Copyright (c) 2019 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of




Zhuang, et al.             Expires May 6, 2020                  [Page 1]

Internet-Draft           open congestion control           November 2019


   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Abbreviations . . . . . . . . . . . . . . . . . . . . . . . .   3
   4.  Observations in storage network . . . . . . . . . . . . . . .   4
   5.  Requirements of the open congestion control architecture  . .   5
   6.  Open Congestion Control (OpenCC) Architecture Overview  . . .   5
     6.1.  Congestion Control Platform and its user interfaces . . .   6
     6.2.  Congestion Control Engine (CCE) and its interfaces  . . .   7
   7.  Interoperability Consideration  . . . . . . . . . . . . . . .   7
     7.1.  Negotiate the congestion control algorithm  . . . . . . .   7
     7.2.  Negotiate the congestion control parameters . . . . . . .   8
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .   8
   9.  Manageability Consideration . . . . . . . . . . . . . . . . .   8
   10. IANA Considerations . . . . . . . . . . . . . . . . . . . . .   8
   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .   8
     11.1.  Normative References . . . . . . . . . . . . . . . . . .   8
     11.2.  Informative References . . . . . . . . . . . . . . . . .   8
   Appendix A.  Experiments  . . . . . . . . . . . . . . . . . . . .   9
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  12

1.  Introduction

   The datacenter networks (DCNs) nowadays is not only providing traffic
   transmission for tenants using TCP/IP network protocol stack, but
   also is required to provide RDMA traffic for High Performance
   Computing (HPC) and distributed storage accessing applications which
   requires low latency and high throughput.

   Thus, for datacenter application nowadays, the requirements of
   latency and throughput are more critical than the normal internet
   traffics, while network congestion and queuing caused by incast is
   the point that increases the traffic latency and affect the network
   throughput.  With this, congestion control algorithms aimed for low
   latency and high bandwidth are proposed such as DCTCP[RFC8257], [BBR]
   for TCP, [DCQCN] for [RoCEv2].

   Besides, the CPU utilization on NICs is another point to improve the
   efficiency of traffic transmission for low latency applications.  By
   offloading some protocol processing into smart NICs and bypassing
   CPU, applications can directly write to hardware which reduces the
   latency of traffic transmission.  RDMA and RoCEv2 is currently a good
   example to show the benefit of bypassing kernel/CPU while TCP
   offloading is also under discussion in [NVMe-oF].



Zhuang, et al.             Expires May 6, 2020                  [Page 2]

Internet-Draft           open congestion control           November 2019


   In general, one hand, the cloud operators or application developers
   are working on new congestion control algorithms to fit requirements
   of applications like HPC, AI, storage in high performance fabrics;
   while on the other hand, smart NIC vendors are working on offloading
   functions of data plane and control plane onto hardware so as to
   reduce the process latency and improve the performance.  In this
   case, it comes up with the question that how smart NICs can be
   optimized by offloading some functions onto the hardware while still
   being able to provide flexibility to customers to develop or change
   their congestion control algorithms and run their experiments more
   easily.

   That said, it might be good to have an open and modular-based design
   for congestion control on smart NICs to be able to develop and deploy
   new algorithms while take the advantage of hardware offloading in a
   generic way.

   This document is to describes an open congestion control architecture
   of high performance fabrics on smart NICs for the cloud operators and
   application developers to install or develop new congestion control
   algorithms as well as select appropriate controls in a more efficient
   and flexible way.

   It only focus on the basic functionality and discuss some common
   interfaces to network environments and also administrators and
   application developers while the detailed implementations should be
   vendors' specific designs and are out of scope.

   Discussions of new congestion control algorithms and improved active
   queue management (AQM) are also out of scope for this document.

2.  Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

3.  Abbreviations

      IB - InfinitBand

      HPC - High Performance Computing

      ECN - Explicit Congestion Notification

      AI/HPC - Artificial Intelligence/High-Performance computing



Zhuang, et al.             Expires May 6, 2020                  [Page 3]

Internet-Draft           open congestion control           November 2019


      RDMA - Remote Direct Memory Access

      NIC - Network Interface Card

      AQM - Active Queue Management

4.  Observations in storage network

   Besides the benefits of easing the development of new congestion
   control algorithms by developers while taking advantage of hardware
   offloading improvement by NIC vendors, we notice that there are also
   benefits to choose proper algorithms for specific traffic patterns.

   As stated, there are several congestion control algorithms for low
   latency high throughput datacenter applications and the industry is
   still working on enhanced algorithms for requirements of new
   applications in the high performance area.  Then, a question might be
   asked, how to select a proper congestion algorithm for the network,
   or whether a selected algorithm is efficient and sufficient to all
   traffics in the network.

   With this question, we use a simplified storage network as a use case
   for study.  In this typical network, it mainly includes two traffic
   types: query and backup.  Query is latency sensitive traffic while
   backup is high throughput traffic.  We select several well-known
   congestion control algorithms (including Reno[RFC5681],
   Cubic[RFC8312], DCTCP[RFC8257], and BBR[BBR]) of TCP for this study.

   Two set of experiments were run to see the performance of these
   algorithms for different traffic types (i.e. traffic patterns).  The
   first set is to study the performance when one algorithm is used for
   both traffic types; the second set is to run the two traffics with
   combinations of congestion algorithms.  The detailed experiments and
   testing results can be found in appendix A.

   According to the result in first experiment set, BBR performs better
   than others when applied for both traffics; while in the second
   experiment set, some algorithm combinations show better performance
   than the same one for both, even compared with BBR.

   As such, we think there are benefits for different traffic patterns
   to select their own algorithm in the same network to achieve better
   performance.  This can also be a reason from cloud operation
   perspective to have an open congestion control on the NIC to select
   proper algorithms for different traffic patterns.






Zhuang, et al.             Expires May 6, 2020                  [Page 4]

Internet-Draft           open congestion control           November 2019


5.  Requirements of the open congestion control architecture

   According to the observation, the architecture design is suggested to
   follow some principles:

   o  Can support developers to write their congestion control
      algorithms onto NICs while keep the benefit of congestion control
      offloading provided by NIC vendors.

   o  Can support vendors to optimize the NIC performance by hardware
      offloading while allow users to deploy and select new congestion
      control algorithms.

   o  Can support settings of congestion controls by administrators
      according to traffic patterns.

   o  Can support settings from applications to provide some QoS
      requirements.

   o  Be transport protocol independent, for example can support both
      TCP and RoCE.

6.  Open Congestion Control (OpenCC) Architecture Overview

   The architecture shown in Figure 1 only states the congestion control
   related components while components for other functions are omitted.
   The OpenCC architecture includes three layers.

   The bottom layer is called the congestion control engine which
   provides common function blocks independent of transport protocols
   which can be implemented in hardware, while the middle layer is the
   congestion control platform in which different congestion control
   algorithms will be deployed here.  These algorithms can be installed
   by NIC vendors or can be developed by algorithm developers.  At last,
   the top layer provides all interfaces (i.e.  APIs) to users, while
   the users can be administrators that can select proper algorithms and
   set proper parameters for their networks, applications that can
   indicate their QoS requirements which can be further mapped to some
   runtime settings of congestion control parameters, and the algorithm
   developers that can write their own algorithms.











Zhuang, et al.             Expires May 6, 2020                  [Page 5]

Internet-Draft           open congestion control           November 2019


              +------------+  +-----------------+   +---------------+
  User        | Parameters |  | Application(run |   | CC developers |
  interfaces  |            |  | time settings)  |   |               |
              +-----+------+  +-------+---------+   +------+--------+
                    |                 |                    |
                    |                 |                    |
                    |                 |                    |
              +-----------------------+---------+          |
              |  Congestion control Algorithms  |          |
              |        +-----------------+      <----------+
  CC platform |       +-----------------+|      |
              |      +-----------------+|+      |
              |      |  CC algorithm#1 |+       |
              |      +-----------------+        |
              +--+--------+---------+---------+-+
                 |        |         |         |
                 |        |         |         |
              +--+--+ +---+---+ +---+----+ +--+---+
              |     | |       | |        | |      |   /  NIC signals
  CC Engine   |Token| |Packet | |Schedule| |CC    |  /--------------
              |mgr  | |Process| |        | |signal|  \--------------
              +-----+ +-------+ +--------+ +------+   \  Network signals


    Figure 1. The architecture of open congestion control

6.1.  Congestion Control Platform and its user interfaces

   The congestion control platform is a software environment to deploy
   and configure various congestion control algorithms.  It contains
   three types of interfaces to the user layer for different usage.

   One is for administrators, which is to select proper congestion
   control algorithms for their network traffics and configure
   corresponding parameters of the selected algorithms.

   The second one can be an interface defined by NIC vendors or
   developers that provide some APIs for application developers to
   define their QoS requirements which will be further mapped to some
   runtime configuration of the controls.

   The last one is for algorithm developers to write their own algorithm
   in the system.  It is suggested to have a defined common language to
   write algorithms which can be further compiled by vendor specific
   environments (in which some toolkits or library can be provided) to
   generate the platform dependent codes.





Zhuang, et al.             Expires May 6, 2020                  [Page 6]

Internet-Draft           open congestion control           November 2019


6.2.  Congestion Control Engine (CCE) and its interfaces

   Components in the congestion control engine can be offloaded to the
   hardware to improve the performance.  As such, it is suggested to
   provide some common and basic functions while the upper platform can
   provide more extensibility and more flexibility for more functions.

   The CCE includes basic modules of packet transmission and
   corresponding control.  Several function blocks are illustrated here
   while the detailed implementation is out of scope for this document
   and left for NIC vendors.  A token manager is used to distribute
   tokens to traffics while the schedule block is to schedule the
   transmission time for these traffics.  The packet process block is to
   edit or process the packet before transmission.  The congestion
   control signal block is to collect or monitor signals from both
   network and other NICs which will be fed to congestion control
   algorithms.

   As such, an interface to get congestion control signal in the
   congestion control should be defined to receive signals from both
   other NICs and networks for existing congestion control algorithms
   and new extensions.  These information will be used as inputs of
   control algorithms to adjust the sending rate and operate the loss
   recovery et.al.

7.  Interoperability Consideration

7.1.  Negotiate the congestion control algorithm

   Since there will be several congestion control algorithms, the host
   might negotiate their supported congestion control capability during
   the session setup phase.  However, it should use the existing way of
   congestion control as default to provide compatibility with legacy
   devices.

   Also, the network devices on the path should be capable to indicate
   their capability of any specific signals that the congestion control
   algorithm needs.  The capability negotiation between NICs and
   Switches can be considered either some in-band ECN-like negotiations
   or out-of-band individual message negotiations.

   Alternatively, the system can also use a centralized administration
   platform to configure the algorithms on NICs and network devices.








Zhuang, et al.             Expires May 6, 2020                  [Page 7]

Internet-Draft           open congestion control           November 2019


7.2.  Negotiate the congestion control parameters

   The parameters might be set by administrators to meet their traffic
   patterns and network environments or be set by mappings from
   application requirements.  Hence, these parameters might be changed
   after the session is set up.  As such, hosts should be able to
   negotiate their parameters when changed or be configured to keep
   consistent.

8.  Security Considerations

   TBD

9.  Manageability Consideration

   TBD

10.  IANA Considerations

   No IANA action

11.  References

11.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

11.2.  Informative References

   [BBR]      Cardwell, N., Cheng, Y., and S. Yeganeh, "BBR Congestion
              Control", <https://tools.ietf.org/html/draft-cardwell-
              iccrg-bbr-congestion-control-00>.

   [DCQCN]    "Congestion Control for Large-Scale RDMA Deployments.",
              <https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/
              p523.pdf>.

   [NVMe-oF]  "NVMe over Fabrics", <https://nvmexpress.org/wp-
              content/uploads/NVMe_Over_Fabrics.pdf>.





Zhuang, et al.             Expires May 6, 2020                  [Page 8]

Internet-Draft           open congestion control           November 2019


   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
              <https://www.rfc-editor.org/info/rfc5681>.

   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

   [RFC8312]  Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and
              R. Scheffenegger, "CUBIC for Fast Long-Distance Networks",
              RFC 8312, DOI 10.17487/RFC8312, February 2018,
              <https://www.rfc-editor.org/info/rfc8312>.

   [RoCEv2]   "Infiniband Trade Association.  InfiniBandTM Architecture
              Specification Volume 1 and Volume 2.",
              <https://cw.infinibandta.org/document/dl/7781>.

Appendix A.  Experiments

   This section includes two sets of experiments to study the
   performance of congestion control algorithms in a simplified storage
   network.  The first set is to study one algorithm applied for both
   query and backup traffics while the second set is to study the
   performance when different algorithms are used for query traffic and
   backup traffic.  The metrics include throughput of backup traffic,
   average completion time of query traffic and 95% percentile query
   completion time.


        +----------+           +----------+
        | Database |           | Database |
        |    S3    ....     ....    S4    |
        +---+------+  .     .  +------+---+
            |         .     .         |
            |         .query.         |
            |         .     .         |
    backup  |         .     .         | backup
            |   .............         |
            |   .     .............   |
            |   .                 .   |
        +---V---V--+           +--V---V---+
        | Database <-----------> Database |
        |    S1    |  backup   |    S2    |
        +----------+           +----------+
   Figure 2. Simplified storage network topology





Zhuang, et al.             Expires May 6, 2020                  [Page 9]

Internet-Draft           open congestion control           November 2019


   All experiments are a full implementation of congestion control
   algorithms on NICs, including Reno, Cubic, DCTCP and BBR.  Our
   experiments includes 4 servers connecting to one switch.  Each server
   with a 10Gbps NIC connected to a 10Gbps port on the switch.  However,
   we limit all ports to 1Gbps to make congestion points.  In the
   experiments, the database server S1 receives backup traffics from
   both S3 and S2 and one query traffic from S4.  The server S2 gets
   back traffics from S1 and S4 and one query traffic from S3.In the
   experiments, three traffic flows are transmitted to S1 from one
   egress port on the switch, which might cause congestion.

   In the first experiment set, we test one algorithm for both traffics.
   The result is shown below in table 1.


 +----------------+-----------+-----------+-----------+-----------+
 |                |   reno    |   cubic   |    bbr    |   dctcp   |
 +----------------+-----------+-----------+-----------+-----------+
 | Throughput MB/s|   64.92   |   65.97   |   75.25   |   70.06   |
 +----------------+-----------+-----------+-----------+-----------+
 |  Avg. comp ms  |  821.61   |  858.05   |   85.68   |   99.90   |
 +----------------+-----------+-----------+-----------+-----------+
 |  95% comp  ms  |  894.65   |  911.23   |  231.75   |  273.92   |
 +----------------+-----------+-----------+-----------+-----------+
 Table 1. Performance when use one cc for both query and backup traffics

   As we can see, the average completion time of BBR and DCTCP is 10
   times better than that of reno and cubic.  BBR is the best to keep
   high throughput.

   In the second set, we test all the combinations of algorithms for the
   two traffics.

   1.  Reno for query traffic

   reno@query
   +----------------+-----------+-----------+-----------+-----------+
   |    @backup     |   cubic   |    bbr    |   dctcp   |    reno   |
   +----------------+-----------+-----------+-----------+-----------+
   | Throughput MB/s|   66.00   |   76.19   |   64.00   |   64.92   |
   +----------------+-----------+-----------+-----------+-----------+
   |  Avg. comp ms  |  859.61   |   81.87   |   18.38   |  821.61   |
   +----------------+-----------+-----------+-----------+-----------+
   |  95% comp  ms  |  917.80   |  149.88   |   20.38   |  894.65   |
   +----------------+-----------+-----------+-----------+-----------+

   Table 2. reno @ query and cubic, bbr, dctcp @ backup




Zhuang, et al.             Expires May 6, 2020                 [Page 10]

Internet-Draft           open congestion control           November 2019


   It shows that given reno used for query traffic, bbr for backup
   traffic gets better throughput compared with other candidates.
   However, dctcp for backup traffic gets much better average completion
   time and 95% completion time, almost 6 times better than those of bbr
   even its throughput is less than bbr.  The reason for this might be
   bbr does not consider lost packets and congestion levels which might
   cause much retransmission.  In this test set, dctcp for backup
   traffic gets better performance.

   2.  Cubic for query traffic

    cubic@query
    +----------------+-----------+-----------+-----------+-----------+
    |    @backup     |   reno    |    bbr    |   dctcp   |   cubic   |
    +----------------+-----------+-----------+-----------+-----------+
    | Throughput MB/s|   64.92   |   75.02   |   65.29   |   65.97   |
    +----------------+-----------+-----------+-----------+-----------+
    |  Avg. comp ms  |  819.23   |   83.50   |   18.42   |  858.05   |
    +----------------+-----------+-----------+-----------+-----------+
    |  95% comp  ms  |  902.66   |  170.96   |   20.99   |  911.23   |
    +----------------+-----------+-----------+-----------+-----------+
   Table 3. cubic @ query and reno, bbr, dctcp @ backup

   The results of cubic for query traffic are similar to those of reno.
   Even with less throughput, dctcp has almost 6 times better than bbr
   in average completion time and 95% completion time, and nearly 10
   times better than those of reno and cubic.

   3.  Bbr for query traffic

   bbr@query
   +----------------+-----------+-----------+-----------+-----------+
   |    @backup     |   reno    |   cubic   |   dctcp   |    bbr    |
   +----------------+-----------+-----------+-----------+-----------+
   | Throughput MB/s|   64.28   |   66.61   |   65.29   |   75.25   |
   +----------------+-----------+-----------+-----------+-----------+
   |  Avg. comp ms  |  866.05   |  895.12   |   18.49   |   85.68   |
   +----------------+-----------+-----------+-----------+-----------+
   |  95% comp  ms  |  925.06   |  967.67   |   20.86   |  231.75   |
   +----------------+-----------+-----------+-----------+-----------+
   Table 4. bbr @ query and reno, cubi, dctcp @ backup

   The results still match those we get from reno and cubic.  In the
   last two columns, dctcp for backup shows better performance even when
   we compared with bbr used for backup.  It indicates that bbr @ query
   and dctcp @ backup is better than bbr @ query and backup.

   4.  Dctcp for query traffic



Zhuang, et al.             Expires May 6, 2020                 [Page 11]

Internet-Draft           open congestion control           November 2019


   dctcp@query
   +----------------+-----------+-----------+-----------+-----------+
   |    @backup     |   reno    |   cubic   |    bbr    |   dctcp   |
   +----------------+-----------+-----------+-----------+-----------+
   | Throughput MB/s|   60.93   |   64.49   |   76.15   |   70.06   |
   +----------------+-----------+-----------+-----------+-----------+
   |  Avg. comp ms  | 2817,53   | 3077.20   |  816.45   |   99.90   |
   +----------------+-----------+-----------+-----------+-----------+
   |  95% comp  ms  | 3448.53   | 3639.94   | 2362.72   |  273.92   |
   +----------------+-----------+-----------+-----------+-----------+
   Table 5. dctcp @ query and reno, cubi, bbr @ backup

   The results for dctcp@query look worse than others in completion
   time, since we don't introduce L4S in the experiments which means
   dctcp will back off most of the time when congestion happens which
   makes the query traffic bares long latency.  The best performance in
   this test set happens at dctcp@backup.  In this setting, both
   traffics have use the same mechanism to back off their traffics.
   However, the number is still worse than when other algorithms are
   used for query and dctcp used for backup.

Authors' Addresses

   Yan Zhuang
   Huawei Technologies Co., Ltd.
   101 Software Avenue, Yuhua District
   Nanjing, Jiangsu  210012
   China

   Email: zhuangyan.zhuang@huawei.com


   Wenhao Sun
   Huawei Technologies Co., Ltd.
   101 Software Avenue, Yuhua District
   Nanjing, Jiangsu  210012
   China

   Email: sam.sunwenhao@huawei.com


   Long Yan
   Huawei Technologies Co., Ltd.
   101 Software Avenue, Yuhua District
   Nanjing, Jiangsu  210012
   China

   Email: yanlong20@huawei.com



Zhuang, et al.             Expires May 6, 2020                 [Page 12]