TAPS Bof Lingli Deng INTERNET-DRAFT China Mobile Expires: September 3, 2014 March 3, 2014 Considerations on Transport Services API for Data Centers draft-deng-taps-datacenter-01 Abstract It is noticed that within a data center, unique traffic pattern and performance goals for the transport layer exist, as compared to things on the Internet. This draft discusses the usecases for applying transport APIs from the perspective of an application running in a data center environment, and proposes potential requirements for such API design. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright and License Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents Expires September 3, 2014 [Page 1] INTERNET DRAFT March 3, 2014 (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Usecases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 Web Search . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 VM Related Traffic . . . . . . . . . . . . . . . . . . . . . 4 3.3 Application Priorities . . . . . . . . . . . . . . . . . . . 4 3.4 Access Type Differentiation . . . . . . . . . . . . . . . . 4 3.5 Delay Tolerant Traffic . . . . . . . . . . . . . . . . . . . 4 4 Transport Optimization in DC . . . . . . . . . . . . . . . . . 4 4.1 Performance degradation in DC . . . . . . . . . . . . . . . 5 4.1.1 Incast Collapse . . . . . . . . . . . . . . . . . . . . 5 4.1.2 Long tail of RTT . . . . . . . . . . . . . . . . . . . . 5 4.1.3 Buffer Pressure . . . . . . . . . . . . . . . . . . . . 5 4.2 Transport Optimization Goals . . . . . . . . . . . . . . . . 5 5 DC Transport API Considerations . . . . . . . . . . . . . . . . 6 5.1 Information Flow From The Above . . . . . . . . . . . . . . 6 5.2 Information Flow From The Bottom . . . . . . . . . . . . . . 7 6 Security Considerations . . . . . . . . . . . . . . . . . . . . 7 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 7 8 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7 9 References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9 Expires September 3, 2014 [Page 2] INTERNET DRAFT March 3, 2014 1 Introduction It is noticed that within a data center, unique traffic pattern and performance goals for the transport layer exist, as compared to things on the Internet. This draft discusses the usecases for applying transport APIs from the perspective of an application running in a data center environment, and proposes potential requirements for such API design. 2 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. DC: Data Center, is a facility used to house computer systems and associated components, such as telecommunications and storage systems. ToR: a Top of Rack switch, usually sits on top of a rack of servers and serves as the entrance to other parts of the data center networking as well as inter-connecting the local servers within the rack. VM: Virtual Machine, is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. VM Migration: Virtual Machine Migration, refers to the process of moving a running virtual machine or application between different physical machines. NIC: Network Interface Controller, is a computer hardware component that connects a computer to a computer network. DCB: Data center bridging, refers to a set of enhancements to Ethernet local area networks for use in data center environments, such as lossless ethernet. 3 Usecases This sections presents usecases for optimized data delivery within a data center. 3.1 Web Search It is identified that within a productive data center hosting a web search engine, there are three types of TCP traffic: (1) highly delay-sensitive short flows resultant from the distributed computing model employed pervasively for interactive Internet application (web search/social networking); (2) highly delay- Expires September 3, 2014 [Page 3] INTERNET DRAFT March 3, 2014 sensitive short flows for cluster control/management; and (3) delay tolerant background flows for backup/synchronization with considerably large data volume. 3.2 VM Related Traffic In virtualized data centers, to cope with the reliability concerns arising from the relatively unreliable general commodity hardware platforms, keeping several identical VM instances running on different physical servers for each other's backup is common practice. In such case, TCP flows for VM backup or migration, although considerably larger in data volume and longer in duration than typical user traffic, are also delay sensitive. 3.3 Application Priorities For data center accommodating multiple applications, one would certainly prefer differentiation in resource provision in case of congestion, according to the DC operator's provisioning policy or the application's own feature. For instance, if physical resources in a data center be shared between a delay-sensitive web search engine and a relatively delay- tolerant document/music sharing application, both application's data traffic share the links from loader-balancer to servers and from servers to database are multiplexed on the internal DC network. 3.4 Access Type Differentiation Given various access types for a specific application, the DC operator may want to enforce different QoS policies to some selected group of users, according to their access type. For instance, if the service provider is currently marketing on the mobile market, it could prioritize mobile traffic over fixed traffic. For potential competing service providers, one may also want to prioritize direct traffic from its own application over other third party users. 3.5 Delay Tolerant Traffic Delay tolerant traffic, including background software upgrade and other management traffic, such as active measurement data traffic for performance monitory/fault detection should not impact any real productive traffic. 4 Transport Optimization in DC To fully understand why we need special transport services for DC Expires September 3, 2014 [Page 4] INTERNET DRAFT March 3, 2014 environment as compared to Internet, it is better to look first at what problems an optimized transport service would be from the perspective of a DC application, beginning with the issues it faces in terms of performance degradation. 4.1 Performance degradation in DC In particular, the following three issues are identified in DC environment in terms of transport performance. 4.1.1 Incast Collapse For the sake of reduced CAPEX, cheap shallow- buffered ToR switches is currently and will be dominating in data centers. Hence it is quite easily that the buffer space of the ToR switch before an aggregator (the server who is responsible for dividing a task into a group of subtasks and collects responses from its relevant working servers for result aggregation) be consumed up the instance that workers submit their subtask through highly synchronized TCP flows, resulting in consistent packet loss over the affected flows. The resultant timeout would cause a dramatic performance degradation, since the regular RTT (less than 10ms) in data center is of magnitudes smaller than the traditional TCP RTO configuration (200ms). 4.1.2 Long tail of RTT Different from typical Internet traffic, without queuing the end2end RTT for a TCP flow is almost reduced to zero, as the propagation delay is trivial (in us) for such short distance between any pair of servers (typically within the distance of tens of meters). Due to the greedy nature of traditional TCP algorithms, the existence of large volume long flows would increasingly builds up queues in buffer space at intermediaries along the way, resulting considerable queuing delay at switches (in ms) for short delay-sensitive flows. 4.1.3 Buffer Pressure Another affect of long queues in buffer space at intermediaries along the way is that it further reduces the actually available buffering space to accommodate bursty delay sensitive short flows, even if they are not submitted in the same time. 4.2 Transport Optimization Goals Since both hardware and software devices are typically deployed and customized by a single DC operator, various private solutions for these issues are proposed, including cross-layer, cross-boundary (requiring cooperation between the network device and end hosts) ones. In solving the above issues, various proposals are made in order to meet some of the following optimization goals: Expires September 3, 2014 [Page 5] INTERNET DRAFT March 3, 2014 (1) Reduce loss/timeout occurance: since TCP performance degradation is caused by packet losses/retransmission timeouts, it is proposed that by finer-tuned RTO configuration and finer-definition timing framework, the impact in result could be largely mitigated[Pannas]. In the meantime, there are work from IEEE DCB family, providing lossless ethernet service from the link layer, which could be rendered to avoid packet loss seen from the IP layer and has been demonstrated to be effective in a coupled solution for DC transport optimization[detail]. (2) Mitigate impact from loss/timeout: delay- based CC algorithms are expected to be more robust to packet losses/timeout in mitigating incast collapse issue[Vegas]. (3) Avoid lengthy buffer queues: as queuing delay substantially impacts the RTT in DC environment, it is motivated to improve performance by keeping the buffering queues short[dctcp] or even empty[hull]. In order to do that, the sender may sense the queue at switches by explicit feedback (ECN by [dctcp]) or implicit delay variation (Vegas[vegas]). (4) Delay prioritized buffer queuing: for resource bounded period, it is essential to make efficient use of limited resource to deliver the most desirable service rather than fair-sharing among all the competitors and fail them all ultimately. Proposals have been made to allow applications to explicitly indicate a flow's delivery preferences (either by absolute deadline information [d3] or by relative priorities[detail]), in order to improve the overall delivery success rate. (5) Smooth traffic bursts: one one hand, (distributed) application would be refined to introduce random offset to avoid concurrent short flow submission peak; on the other hand, random offset would be introduced to RTO backoff calculation to mitigate retransmission synchronization [Pannas]. Moreover, physical pacing at NIC level is proposed to counter the effect of traffic bursts caused by general OS server performance optimization techniques[d2tcp]. 5 DC Transport API Considerations According to the above discussion, it is believed that the following information flows should be supported by optimized APIs between application to the core transport service. 5.1 Information Flow From The Above (1) Delivery related: refers to the information from the application about its expectation on data delivery. For example, a explicit performance expectation could be specified by (1.1) absolute delay requirement; or (1.2) relative priority indication. (2) Retransmission related: refers to the information from the application about how the transport would deal with packet losses. For example, the information could include: (2.1) loss recovery needed or not; (2.2) if so, preferred retransmission timeout granularity. Expires September 3, 2014 [Page 6] INTERNET DRAFT March 3, 2014 (3) Pacing related: the information from the application about its expectation about the flow for the applicability for pacing. For example, the information could include: (3.1) traffic duration, in case of pacing for long flows only policy; (3.2) burstyness expectation. 5.2 Information Flow From The Bottom Congestion status: refers to information from the network device or local transport layer about the congestion status of the current transport connection/path. 6 Security Considerations TBA. 7 Acknowledgements The authors wish to thank Zhen Cao, Hui Deng and Michael Welzl for providing comments, feedback, and improvement proposals on the document. 8 IANA Considerations There is no IANA action in this document. Expires September 3, 2014 [Page 7] INTERNET DRAFT March 3, 2014 9 References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [Pannas] Vasudevan V, Phanishayee A, Shah H, et al. Safe and effective fine-grained TCP retransmissions for datacenter communication[C]//ACM SIGCOMM computer communication review. ACM, 2009, 39(4): 303-314. [detail] Zats D, Das T, Mohan P, et al. DeTail: reducing the flow completion time tail in datacenter networks[J]. ACM SIGCOMM Computer Communication Review, 2012, 42(4): 139- 150. [vegas] Lee C, Jang K, Moon S. Reviving delay-based TCP for data centers[C]//Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication. ACM, 2012: 111-112. [dctcp] Alizadeh M, Greenberg A, Maltz D A, et al. Data center tcp (dctcp)[J]. ACM SIGCOMM computer communication review, 2011, 41(4): 63-74. [hull] Alizadeh M, Kabbani A, Edsall T, et al. Less is more: trading a little bandwidth for ultra-low latency in the data center[C]//Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012: 19-19. [d2tcp] Vamanan B, Hasan J, Vijaykumar T N. Deadline-aware datacenter tcp (d2tcp)[J]. ACM SIGCOMM Computer Communication Review, 2012, 42(4): 115-126. [d3] Alizadeh M, Kabbani A, Edsall T, et al. Less is more: trading a little bandwidth for ultra-low latency in the data center[C]//Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012: 19-19. Expires September 3, 2014 [Page 8] INTERNET DRAFT March 3, 2014 Authors' Addresses Lingli Deng China Mobile Email: denglingli@chinamobile.com Zhen Cao China Mobile Email: caozhen@chinamobile.com Hui Deng China Mobile Email: denghui@chinamobile.com Expires September 3, 2014 [Page 9]