TAPS Bof                                                     Lingli Deng
INTERNET-DRAFT                                              China Mobile

Expires: September 3, 2014                                 March 3, 2014



       Considerations on Transport Services API for Data Centers
                     draft-deng-taps-datacenter-01

Abstract

   It is noticed that within a data center, unique traffic pattern and
   performance goals for the transport layer exist, as compared to
   things on the Internet.  This draft discusses the usecases for
   applying transport APIs from the perspective of an application
   running in a data center environment, and proposes potential
   requirements for such API design.

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as
   Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html


Copyright and License Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
 


<Deng, et al.>         Expires September 3, 2014                [Page 1]

INTERNET DRAFT           <ISP TURN For WEBRTC>             March 3, 2014


   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document. Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.



Table of Contents

   1  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2  Terminology . . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3 Usecases . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3
     3.1 Web Search . . . . . . . . . . . . . . . . . . . . . . . . .  3
     3.2 VM Related Traffic . . . . . . . . . . . . . . . . . . . . .  4
     3.3 Application Priorities . . . . . . . . . . . . . . . . . . .  4
     3.4  Access Type Differentiation . . . . . . . . . . . . . . . .  4
     3.5 Delay Tolerant Traffic . . . . . . . . . . . . . . . . . . .  4
   4  Transport Optimization in DC  . . . . . . . . . . . . . . . . .  4
     4.1 Performance degradation in DC  . . . . . . . . . . . . . . .  5
       4.1.1 Incast Collapse  . . . . . . . . . . . . . . . . . . . .  5
       4.1.2 Long tail of RTT . . . . . . . . . . . . . . . . . . . .  5
       4.1.3 Buffer Pressure  . . . . . . . . . . . . . . . . . . . .  5
     4.2 Transport Optimization Goals . . . . . . . . . . . . . . . .  5
   5 DC Transport API Considerations  . . . . . . . . . . . . . . . .  6
     5.1 Information Flow From The Above  . . . . . . . . . . . . . .  6
     5.2 Information Flow From The Bottom . . . . . . . . . . . . . .  7
   6  Security Considerations . . . . . . . . . . . . . . . . . . . .  7
   7  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . .  7
   8  IANA Considerations . . . . . . . . . . . . . . . . . . . . . .  7
   9  References  . . . . . . . . . . . . . . . . . . . . . . . . . .  8
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . .  9














 


<Deng, et al.>         Expires September 3, 2014                [Page 2]

INTERNET DRAFT           <ISP TURN For WEBRTC>             March 3, 2014


1  Introduction

   It is noticed that within a data center, unique traffic pattern and
   performance goals for the transport layer exist, as compared to
   things on the Internet.  This draft discusses the usecases for
   applying transport APIs from the perspective of an application
   running in a data center environment, and proposes potential
   requirements for such API design.

2  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

   DC: Data Center, is a facility used to house computer systems and
   associated components, such as telecommunications and storage
   systems.

   ToR: a Top of Rack switch, usually sits on top of a rack of servers
   and serves as the entrance to other parts of the data center
   networking as well as inter-connecting the local servers within the
   rack.

   VM: Virtual Machine, is a software implementation of a machine (i.e.
   a computer) that executes programs like a physical machine.

   VM Migration: Virtual Machine Migration, refers to the process of
   moving a running virtual machine or application between different
   physical machines.

   NIC: Network Interface Controller, is a computer hardware component
   that connects a computer to a computer network.

   DCB: Data center bridging, refers to a set of enhancements to
   Ethernet local area networks for use in data center environments,
   such as lossless ethernet.

3 Usecases

   This sections presents usecases for optimized data delivery within a
   data center.

3.1 Web Search It is identified that within a productive data center
   hosting a web search engine, there are three types of TCP traffic:
   (1) highly delay-sensitive short flows resultant from the distributed
   computing model employed pervasively for interactive Internet
   application (web search/social networking); (2) highly delay-
 


<Deng, et al.>         Expires September 3, 2014                [Page 3]

INTERNET DRAFT           <ISP TURN For WEBRTC>             March 3, 2014


   sensitive short flows for cluster control/management; and (3) delay
   tolerant background flows for backup/synchronization with
   considerably large data volume.

3.2 VM Related Traffic

   In virtualized data centers, to cope with the reliability concerns
   arising from the relatively unreliable general commodity hardware
   platforms, keeping several identical VM instances running on
   different physical servers for each other's backup is common
   practice. In such case, TCP flows for VM backup or migration,
   although considerably larger in data volume and longer in duration
   than typical user traffic, are also delay sensitive.


3.3 Application Priorities

   For data center accommodating multiple applications, one would
   certainly prefer differentiation in resource provision in case of
   congestion, according to the DC operator's provisioning policy or the
   application's own feature.

   For instance, if physical resources in a data center be shared
   between a delay-sensitive web search engine and a relatively delay-
   tolerant document/music sharing application, both application's data
   traffic share the links from loader-balancer to servers and from
   servers to database are multiplexed on the internal DC network.

3.4  Access Type Differentiation

   Given various access types for a specific application, the DC
   operator may want to enforce different QoS policies to some selected
   group of users, according to their access type. For instance, if the
   service provider is currently marketing on the mobile market, it
   could prioritize mobile traffic over fixed traffic. 

   For potential competing service providers, one may also want to
   prioritize direct traffic from its own application over other third
   party users.

3.5 Delay Tolerant Traffic Delay tolerant traffic, including background
   software upgrade and other management traffic, such as active
   measurement data traffic for performance monitory/fault detection
   should not impact any real productive traffic.

4  Transport Optimization in DC

   To fully understand why we need special transport services for DC
 


<Deng, et al.>         Expires September 3, 2014                [Page 4]

INTERNET DRAFT           <ISP TURN For WEBRTC>             March 3, 2014


   environment as compared to Internet, it is better to look first at
   what problems an optimized transport service would be from the
   perspective of a DC application, beginning with the issues it faces
   in terms of performance degradation.

4.1 Performance degradation in DC In particular, the following three
   issues are identified in DC environment in terms of transport
   performance.

4.1.1 Incast Collapse For the sake of reduced CAPEX, cheap shallow-
   buffered ToR switches is currently and will be dominating in data
   centers. Hence it is quite easily that the buffer space of the ToR
   switch before an aggregator (the server who is responsible for
   dividing a task into a group of subtasks and collects responses from
   its relevant working servers for result aggregation) be consumed up
   the instance that  workers submit their subtask through highly
   synchronized TCP flows, resulting in consistent packet loss over the
   affected flows. The resultant timeout would cause a dramatic
   performance degradation, since the regular RTT (less than 10ms) in
   data center is of magnitudes smaller than the traditional TCP RTO
   configuration (200ms).

4.1.2 Long tail of RTT

   Different from typical Internet traffic, without queuing the end2end
   RTT for a TCP flow is almost reduced to zero, as the propagation
   delay is trivial (in us) for such short distance between any pair of
   servers (typically within the distance of tens of meters). Due to the
   greedy nature of traditional TCP algorithms, the existence of large
   volume long flows would increasingly builds up queues in buffer space
   at intermediaries along the way, resulting considerable queuing delay
   at switches (in ms) for short delay-sensitive flows.

4.1.3 Buffer Pressure

   Another affect of long queues in buffer space at intermediaries along
   the way is that it further reduces the actually available buffering
   space to accommodate bursty delay sensitive short flows, even if they
   are not submitted in the same time.

4.2 Transport Optimization Goals Since both hardware and software
   devices are typically deployed and customized by a single DC
   operator, various private solutions for these issues are proposed,
   including cross-layer, cross-boundary      (requiring cooperation
   between the network device and end hosts) ones.

   In solving the above issues, various proposals are made in order to
   meet some of the following optimization goals:
 


<Deng, et al.>         Expires September 3, 2014                [Page 5]

INTERNET DRAFT           <ISP TURN For WEBRTC>             March 3, 2014


   (1) Reduce loss/timeout occurance: since TCP performance degradation
   is caused by packet losses/retransmission timeouts, it is proposed
   that by finer-tuned RTO configuration and finer-definition timing
   framework, the impact in result could be largely mitigated[Pannas].
   In the meantime, there are work from IEEE DCB family, providing
   lossless ethernet service from the link layer, which could be
   rendered to avoid packet loss seen from the IP layer and has been
   demonstrated to be effective in a coupled solution for DC transport
   optimization[detail]. (2) Mitigate impact from loss/timeout: delay-
   based CC algorithms are expected to be more robust to packet
   losses/timeout in mitigating incast collapse issue[Vegas]. (3) Avoid
   lengthy buffer queues: as queuing delay substantially impacts the RTT
   in DC environment, it is motivated to improve performance by keeping
   the buffering queues short[dctcp] or even empty[hull]. In order to do
   that, the sender may sense the queue at switches by explicit feedback
   (ECN by [dctcp]) or implicit delay variation (Vegas[vegas]). (4)
   Delay prioritized buffer queuing: for resource bounded period, it is
   essential to make efficient use of limited resource to deliver the
   most desirable service rather than fair-sharing among all the
   competitors and fail them all ultimately. Proposals have been made to
   allow applications to explicitly indicate a flow's delivery
   preferences (either by absolute deadline information [d3] or by
   relative priorities[detail]), in order to improve the overall        
   delivery success rate. (5) Smooth traffic bursts: one one hand,
   (distributed) application would be refined to introduce random offset
   to avoid concurrent short flow submission peak; on the other hand,
   random offset would be introduced to RTO backoff calculation to
   mitigate retransmission synchronization [Pannas]. Moreover, physical
   pacing at NIC level is proposed to counter the effect of traffic
   bursts caused by general OS server performance optimization
   techniques[d2tcp].

5 DC Transport API Considerations According to the above discussion, it
   is believed that the following information flows should be supported
   by optimized APIs between application to the core transport service.

5.1 Information Flow From The Above (1) Delivery related: refers to the
   information from the application about its expectation on data
   delivery. For example, a explicit performance expectation could be
   specified by (1.1) absolute delay requirement; or (1.2) relative
   priority indication.

   (2) Retransmission related: refers to the information from the
   application about how the transport would deal with packet losses.
   For example, the information could include: (2.1) loss recovery
   needed or not; (2.2) if so, preferred retransmission timeout
   granularity.

 


<Deng, et al.>         Expires September 3, 2014                [Page 6]

INTERNET DRAFT           <ISP TURN For WEBRTC>             March 3, 2014


   (3) Pacing related: the information from the application about its
   expectation about the flow for the applicability for pacing.         
    For example, the information could include: (3.1) traffic duration,
   in case of pacing for long flows only policy; (3.2) burstyness
   expectation.

5.2 Information Flow From The Bottom Congestion status: refers to
   information from the network device or local transport layer about
   the congestion status of the current transport connection/path.

6  Security Considerations

   TBA.

7  Acknowledgements

   The authors wish to thank Zhen Cao, Hui Deng and Michael Welzl for
   providing comments, feedback, and improvement proposals on the
   document.


8  IANA Considerations

   There is no IANA action in this document.
























 


<Deng, et al.>         Expires September 3, 2014                [Page 7]

INTERNET DRAFT           <ISP TURN For WEBRTC>             March 3, 2014


9  References


   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.


   [Pannas] Vasudevan V, Phanishayee A, Shah H, et al. Safe and
              effective fine-grained TCP retransmissions for datacenter
              communication[C]//ACM SIGCOMM computer communication
              review. ACM, 2009, 39(4): 303-314.

   [detail] Zats D, Das T, Mohan P, et al. DeTail: reducing the flow
              completion time tail in datacenter networks[J]. ACM
              SIGCOMM Computer Communication Review, 2012, 42(4): 139-
              150.

   [vegas] Lee C, Jang K, Moon S. Reviving delay-based TCP for data
              centers[C]//Proceedings of the ACM SIGCOMM 2012 conference
              on Applications, technologies, architectures, and
              protocols for computer communication. ACM, 2012: 111-112.

   [dctcp] Alizadeh M, Greenberg A, Maltz D A, et al. Data center tcp
              (dctcp)[J]. ACM SIGCOMM computer communication review,
              2011, 41(4): 63-74.

   [hull] Alizadeh M, Kabbani A, Edsall T, et al. Less is more: trading
              a little bandwidth for ultra-low latency in the data
              center[C]//Proceedings of the 9th USENIX conference on
              Networked Systems Design and Implementation. USENIX
              Association, 2012: 19-19.

   [d2tcp] Vamanan B, Hasan J, Vijaykumar T N. Deadline-aware datacenter
              tcp (d2tcp)[J]. ACM SIGCOMM Computer Communication Review,
              2012, 42(4): 115-126.


   [d3] Alizadeh M, Kabbani A, Edsall T, et al. Less is more: trading a
              little bandwidth for ultra-low latency in the data
              center[C]//Proceedings of the 9th USENIX conference on
              Networked Systems Design and Implementation. USENIX
              Association, 2012: 19-19.






 


<Deng, et al.>         Expires September 3, 2014                [Page 8]

INTERNET DRAFT           <ISP TURN For WEBRTC>             March 3, 2014


Authors' Addresses


   Lingli Deng
   China Mobile

   Email: denglingli@chinamobile.com



   Zhen Cao
   China Mobile

   Email: caozhen@chinamobile.com



   Hui Deng
   China Mobile

   Email: denghui@chinamobile.com






























<Deng, et al.>         Expires September 3, 2014                [Page 9]