Network Working Group H. Naderi Internet-Draft B. Carpenter, Ed. Intended status: Informational Univ. of Auckland Expires: October 24, 2015 April 22, 2015 Experience with IPv6 path probing draft-naderi-ipv6-probing-01 Abstract This document reports on experience and simulations of dynamic probing of alternate paths between two IPv6 hosts when network failures occur. Two models for such probing were investigated: the SHIM6 REAchability Protocol (REAP) and the Multipath Transmission Control Protocol (MPTCP). The motivation for this document is to identify some aspects of path probing at large or very large scale that may be broadly relevant to future protocol design. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on October 24, 2015. Copyright Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Naderi & Carpenter Expires October 24, 2015 [Page 1] Internet-Draft IPv6 Probing April 2015 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Results for SHIM6 and REAP . . . . . . . . . . . . . . . . . 3 2.1. Experiments over the Internet . . . . . . . . . . . . . . 3 2.2. Lab Experiments . . . . . . . . . . . . . . . . . . . . . 5 2.3. Large scale simulation . . . . . . . . . . . . . . . . . 5 3. Results for MPTCP . . . . . . . . . . . . . . . . . . . . . . 7 4. Operational issues . . . . . . . . . . . . . . . . . . . . . 8 5. Implications for future designs . . . . . . . . . . . . . . . 9 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 9. Change log [RFC Editor: Please remove] . . . . . . . . . . . 10 10. Informative References . . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 1. Introduction A common situation in the Internet today is that a host trying to contact another host has a choice of IP addresses for one or both ends of the communication. Multiple addresses are expected to be quite common for IPv6 hosts [RFC2460]. Some approaches to this situation envisage either switching paths during the course of the communication or using multiple paths in parallel. Examples include "Happy Eyeballs" [RFC6555] which tries alternative paths at the start, SHIM6 [RFC5533] and Stream Control Transmission Protocol (SCTP) [RFC4960] which change paths when there is a failure, and Multipath TCP (MPTCP) [RFC6824] which shares the paths dynamically. Some of these methods involve active path probing to choose the best one. SHIM6 probes all available paths using the REAchability Protocol (REAP) [RFC5534] when the current path fails, and MPTCP effectively probes all paths continuously, and shifts load according to the results. In this document we summarise results and observations from SHIM6 and MPTCP operated or simulated at large scale. These observations may be of help in designing future path probing mechanisms. In particular, we are interested in minimising both the time taken to recover to the maximum possible throughput after a path failure, and the amount of overhead traffic caused by the probing process. In summary, we ran a series of SHIM6 experiments, each including 250 path failures, between Auckland and Dublin, measuring the time and overhead traffic for each instance of path probing and recovery. Naderi & Carpenter Expires October 24, 2015 [Page 2] Internet-Draft IPv6 Probing April 2015 Then we repeated essentially the same experiment in the laboratory in Auckland (i.e., with negligible RTT instead of round-the-world RTT). Then we built a Stochastic Activity Network (SAN) simulation model of the same scenarios, and validated it by comparison with the experimental results. Finally we used this model to simulate path failure and recovery using REAP at very large scale (10,000 simultaneous sessions on a single site experiencing path failure). Both TCP and DCCP [RFC4340] were used for the transport layer, with a simple application sending meaningless data in one direction only. This was followed by roughly equivalent simulations of recovery from path failure for MPTCP sessions. In this case we validated the SAN model by comparison with a completely different MPTCP simulator developed elsewhere [Wischik10]. One advantage of the SAN model is that there are SAN analysis software tools which allow very large scale simulations. Another is that it makes it relatively easy to experiment with variations of the protocol itself, so we did test the impact of certain protocol changes. However, unlike conventional network simulation tools, the user has to program a complete protocol behaviour model. We used the Moebius tool [Moebius]. Details of the experiments and results have been described in two papers [Naderi10] [Naderi14b] and in H. Naderi's thesis [Naderi14a]. This document limits itself to outlining the results and their implications for the design of path probing mechanisms in the Internet. 2. Results for SHIM6 and REAP 2.1. Experiments over the Internet We set up a test environment which enabled us to run a set of experiments over the Internet with the LinShim6 implementation of SHIM6 [Barre08]. We have used two SHIM6-enabled multi- addressed hosts, located in the University of Auckland (New Zealand) and Waterford Institute of Technology (Dublin, Ireland). Each host was equipped with two network interface cards and configured with two prefixes from two different providers. The SHIM6 host in Auckland was connected to a router which was a Linux machine and was configured as an IPv6 router. This router simulated link failures for the experiments. Source Address Dependent Routing (SADR) is necessary for effective use of SHIM6. Hosts decide what source and destination address to use when host-centric solutions, like SHIM6, are used. Without SADR, or similar mechanism for routing, packets might be forwarded to the Naderi & Carpenter Expires October 24, 2015 [Page 3] Internet-Draft IPv6 Probing April 2015 wrong address providers and dropped because of ingress filtering according to BCP 38 [RFC2827] [RFC3704]. Unfortunately, we could not convince the university network administrators to enable SADR on the Auckland University edge router. To run the experiments, they agreed to add static routes to the edge router's routing table, to forward packets destined to the host in Dublin through different providers according to their destination addresses. Therefore, only two address pairs out of four possible address pairs could work. To resolve this issue, we have changed LinShim6 to shuffle the list of address pairs before starting the exploration process in order to put the working address pair in a random location in the list. As a result, the working address pair could appear in any location in the list and thus create different recovery cases. This configuration enabled us to run experiments with four address pairs over the Internet. For each experiment, we artificially created 250 failures and for each case measured the REAP exploration time (EP), number of sent (SP) and received probes (RP) and application recovery time (ART). Comparing results from experiments with TCP and DCCP shows that when DCCP is employed, EP, SP and RP are bigger than when TCP is used. The main reason for this is that DCCP employs delayed acknowledgement. It sends ACKs every RTT (300 ms), while in case of TCP, they are sent more frequently (less than 100 ms apart). Since the RTT is long, the communications look different from REAP's view point although the behaviour of the application is the same in both experiments. Since TCP sends ACKs faster, REAP treats it more like a bi-directional communication while DCCP communication is treated more like uni-directional. As a result, in the DCCP experiment, the sender always detects the failure first and then reports it to the receiver, while in the TCP experiment both sides detect failure and start exploration almost at the same time. In other words, in case of TCP, exploration is performed in parallel on both sides and takes less time and generates less traffic. This result also shows that the efficiency of the solutions, like SHIM6, which are implemented inside the protocol stack may be affected by the behaviour of the other layers of the protocol stack as well. We also observed some signs of probe loss in the results. Probe losses can affect EP, SP, RP and ART. When a probe is lost, it might cause the exploration process to go to a second round, and then an exponential backoff algorithm causes the exploration process to take longer and generate more traffic. Naderi & Carpenter Expires October 24, 2015 [Page 4] Internet-Draft IPv6 Probing April 2015 2.2. Lab Experiments We repeated similar experiments in the lab. The main difference was RTT which was much smaller (0.3 ms) than in the Internet experiments. We setup two SHIM6 hosts in the lab, each equipped with four network interfaces. Thus, in addition to experiments with four address pairs (similar to the Internet experiments), we could run experiments with 9 and 16 address pairs as well. In the lab, we got similar results from the TCP and DCCP experiments. Since RTT is small, DCCP sends ACKs faster, and therefore there is no difference from REAP's viewpoint. Probe losses are observable in the lab experiments too. Probe loss causes REAP to go to the second round for scanning the list of address pairs, which leads to sending more probes and also longer exploration time. Experiments with 16 address pairs fail when the working address pair is located at or close to the end of the list of address pairs. REAP employs exponential backoff after sending its initial probes, to avoid generating large bursts of traffic during exploration. For 16 address pairs, this delay sometimes causes the connection to time out and stop the experiment. In some cases, SHIM6 removes the context without finding the new address pair. In such cases it seems that packet losses cause the exploration process to go to the second round of exploration and the resulting longer delays cause SHIM6 to actually stop exploration and remove the context. 2.3. Large scale simulation To study the behaviour of REAP in a very large scale network (e.g., an enterprise network), we built a simulation model of REAP and conducted some experiments which simulated a link failure event in a network with 10,000 simultaneously active SHIM6-monitored communications. The aim of the experiments was to see how REAP reacts to path failures in a large SHIM6-enabled multihomed network. In our practical tests, nine address pairs seems to be the limit but we have included larger numbers in our simulations to obtain a clearer view of REAP's behaviour. We focused on REAP recovery time and probe traffic as two important performance parameters. REAP recovery time is the time that REAP takes to detect the failure and find a new working address pair. REAP traffic is the traffic which is generated by REAP itself during its exploration process. Naderi & Carpenter Expires October 24, 2015 [Page 5] Internet-Draft IPv6 Probing April 2015 We measured average and total REAP recovery time for different numbers of address pairs for 10,000 instances of REAP. We define total REAP recovery time as the recovery time for the whole site, i.e., the time between failure occurrence and recovering the last context. In other words, it shows the recovery time for the last context that is recovered. The average recovery time is calculated by dividing the sum of recovery times for REAP instances by the number of REAP instances. It should be noted that recovery time includes failure detection and address exploration times. A typical average recovery time for 4 address pairs is 10 to 12 seconds. The results show that the average and maximum recovery time increase when the number of address pairs is increased. The correlation is not linear because REAP uses an exponential backoff algorithm for increasing the time interval between probes. As a result, REAP shows poor performance when the number of address pairs exceeds 9, for example exceeding 100 seconds to recover with 16 address pairs. We also measured the average and total number of probes sent during the address exploration process in the experiments. The results show that there is a linear correlation between number of address pairs and number of sent probes. They also show that a large quantity of probes is sent at the start of exploration. For example, in the case of four address pairs, 93% of the probes, and in the case of 25 address pairs 34% of probes, are sent during the first 10 seconds. The reason is that all contexts detect failure within 10 seconds and start exploration by sending initial probes (the first four probes, which are sent in two seconds). After that, there are some intervals when very few probes are sent. This can be seen more clearly in the experiments with more address pairs, e.g. 16 or 25 address pairs. This means that for some SHIM6 contexts the time interval between probes is large, because of the exponential backoff, so REAP instances have to wait for a long time before probing the next address pair. Some connections might be dropped by the transport or application layer before REAP can recover them. For example, in case of 25 address pairs, 50% of contexts need more than five minutes to recover. Although the peak of the REAP traffic is generated in the first 10 seconds (before employing the exponential backoff algorithm), our results show that this traffic is small compared to normal traffic for a large network, and cannot cause a major problem. For example, in the case of 25 address pairs, about 4800 probes per second are sent during the first 10 seconds of the exploration process, which is the peak of the traffic. Every probe in the first 10 seconds carries at most seven address pairs; four initial address pairs and three more after employing exponential backoff. Thus, the average probe Naderi & Carpenter Expires October 24, 2015 [Page 6] Internet-Draft IPv6 Probing April 2015 size in the first 10 seconds is 232 bytes; each probe needs 72 bytes for the fixed part and 40 bytes for each address pair. As a result, a load of 4800 probes per second does not occupy more than one MB/s of the site's available link capacity. Large sites usually have high bandwidth links to the Internet and this amount of traffic does not cause a significant problem for them. In any case this traffic will occur at a time when normal traffic from the same sessions has been interrupted. We also tried two changes to REAP to improve recovery time: Increasing the number of initial probes, and sending initial probes in parallel. In both cases, we also measured the probe traffic. The results showed that those modifications improved recovery time while their effect on the traffic were not big. For example, in case of nine address pairs, increasing the number of initial probes from four to five caused about 6.5% increase in traffic in the first 10 seconds of the recovery process, 22% decrease in average recovery time and 34% decrease in maximum recovery time. Sending initial probes in parallel, in the case of nine address pairs, caused an 11% decrease in average recovery time, 4.5% decrease in maximum recovery time, and 8.2% increase in traffic. In both cases, these modifications increased traffic but not to the level that could not be handled in a large network. 3. Results for MPTCP MPTCP does not use any specific mechanism for probing paths. In fact, every subflow runs as a TCP flow and it is the TCP congestion control mechanism which monitors the used path. When congestion is detected, the load from the congested path is transferred to other available paths, if they present less congestion. The MPCTP congestion control algorithm, known as SEMICOUPLED, reacts to congestion reports from subflows and adjusts the load on the used paths to achieve performance and fairness. TCP never sets the congestion window for a subflow to less than 1. Therefore, even on a highly congested path or a broken path, it performs the equivalent of probing by setting the congestion window size to 1, so that any improvements in the path can be detected. Expiration of the TCP retransmission timer for the subflow on a broken path triggers sending a segment once in a while, acting as a probe, to ensure a recovery in the path can be detected. How fast this mechanism can detect an improvement in a broken path depends on the value of the time-out for this timer (RTO). The minimum value is usually set to 1 second and consequent expirations, the case for a broken path, back off the timer value and multiplies RTO by 2. The traffic generated by this mechanism in this case is low and may be handled easily, even in a large network. Naderi & Carpenter Expires October 24, 2015 [Page 7] Internet-Draft IPv6 Probing April 2015 We simulated MPTCP with up to 8 paths and with RTTs between 80 and 150 ms, observing the expected behaviour, with the load in the steady state spread across the paths. When the loss rate of a path is higher, the throughput of that path is lower. For a given loss rate, a smaller RTT increases throughput on that path. However, total throughput increases sublinearly with more paths, due to the way SEMICOUPLED links the congestion windows of the various subflows. For example, we simulated a scenario in which the steady state throughput for 8 paths was only about 25% greater than for a single path (Figure 5.10 in [Naderi14a]). This suggests that a scenario with as many as 8 paths is of limited value in a reasonably reliable network. We simulated a permanent failure of a single path in a scenario with four paths in operation. As may be deduced from the previous point, the throughput recovered in the steady state to within a small percentage of its previous value. This recovery took about 6 seconds (Figure 5.15 in [Naderi14a]), which is significantly faster than observed with SHIM6 due to MPTCP's effectively continuous probing. Simulations of temporary path failures showed that returning to the original steady state using all paths took a similar time. Finally we simulated the effect of variable loss rates on MPTCP performance with two paths operating. We observed that for loss rates varying randomly in the range up to 1%, MPTCP effectively maintains its steady state throughput. 4. Operational issues Many if not most site border firewalls today drop packets containing the SHIM6 extension header. In our Internet experiments we had to bypass the site firewall at both ends. This issue is discussed in [RFC7045]. Source Address Dependent Routing (SADR) is necessary for effective use of multiple paths. Without it, packets may be sent to the wrong exit router, or to an ISP that will immediately discard them due to ingress filtering. With ingress filtering in place, packets with a given source address may only be sent via an ISP that accepts packets from that source address. If this is not taken correctly into account by the source host and by the local routing configuration, the host will waste resources trying to explore paths that are certain to fail. Naderi & Carpenter Expires October 24, 2015 [Page 8] Internet-Draft IPv6 Probing April 2015 5. Implications for future designs We suggest several conclusions from the above results that should be relevant to the design of any probing mechanism for exploiting alternative paths between two hosts: o The interaction between round-trip time, the transport layer acknowledgement mechanism, and the failure detection mechanism is quite subtle and significantly affects the time taken to start recovery after a failure. o When probing is linked to congestion control, packet loss rates may also affect recovery times. o Probe traffic is unlikely to cause overload, especially since normal traffic stops during recovery from failure. o Exponential backoff leads to significantly slower recovery time, and (due to the previous point) is probably unnecessary. o Probing all alternative paths in parallel leads to significantly faster recovery times with only a minor increase in the intensity of probe traffic, although this does occur on the paths that are still carrying normal traffic. However, full sized probe packets (as used by MPTCP, because they are normal data packets) have more impact than short probe packets (as used by SHIM6). o The probe packets should resemble normal data packets as much as possible, in order to avoid being treated specially or dropped by middleboxes such as firewalls or load balancers. o If Source Address Dependent Routing (SADR) is unavailable, it is better to avoid probing address pairs that will fail as a result. (Probing all paths in parallel would in fact mask this problem.) o There is little to be gained by having more than two or three alternative paths. 6. Security Considerations Apart from the need for SHIM6 to bypass firewalls, no security issues were identified during this work. 7. IANA Considerations This document requests no action by IANA. Naderi & Carpenter Expires October 24, 2015 [Page 9] Internet-Draft IPv6 Probing April 2015 8. Acknowledgements This document was produced using the xml2rfc tool [RFC2629]. Some text was adapted from [Naderi14a]. John Ronan from the Telecommunications Software and Systems Group, Waterford Institute of Technology, and the University of Auckland Information Technology Services (ITS) helped to run the SHIM6 experiments over the Internet between Auckland and Dublin. 9. Change log [RFC Editor: Please remove] draft-naderi-ipv6-probing-01: editorial improvements, 2015-04-22. draft-naderi-ipv6-probing-00: original version, 2014-10-21. 10. Informative References [Barre08] Barre, S., "LinShim6 - implementation of the Shim6 protocol", Technical Report, Universite catholique de Louvain , February 2008. [Moebius] Deavours, D., Clark, G., Courtney, T., Daly, D., Derisavi, S., Doyle, J., Sanders, W., and P. Webster, "The Moebius framework and its implementation", IEEE Transactions on Software Engineering 28(10):956-969, October 2002. [Naderi10] Naderi, H. and B. Carpenter, "A Performance Study on REAchability Protocol in Large Scale IPv6 Networks", Second International Conference on Computer and Network Technology (ICCNT 2010), Bangkok 28-32, April 2010. [Naderi14a] Naderi, H., "Evaluating and Improving SHIM6 and MPTCP: Two Solutions for IPv6 Multihoming", Ph.D. Thesis, The University of Auckland , July 2014. [Naderi14b] Naderi, H. and B. Carpenter, "Putting SHIM6 into Practice", Australasian Telecommunication Networks and Applications Conference (ATNAC 2014), Melbourne , November 2014. [RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998. Naderi & Carpenter Expires October 24, 2015 [Page 10] Internet-Draft IPv6 Probing April 2015 [RFC2629] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, June 1999. [RFC2827] Ferguson, P. and D. Senie, "Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoofing", BCP 38, RFC 2827, May 2000. [RFC3704] Baker, F. and P. Savola, "Ingress Filtering for Multihomed Networks", BCP 84, RFC 3704, March 2004. [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram Congestion Control Protocol (DCCP)", RFC 4340, March 2006. [RFC4960] Stewart, R., "Stream Control Transmission Protocol", RFC 4960, September 2007. [RFC5533] Nordmark, E. and M. Bagnulo, "Shim6: Level 3 Multihoming Shim Protocol for IPv6", RFC 5533, June 2009. [RFC5534] Arkko, J. and I. van Beijnum, "Failure Detection and Locator Pair Exploration Protocol for IPv6 Multihoming", RFC 5534, June 2009. [RFC6555] Wing, D. and A. Yourtchenko, "Happy Eyeballs: Success with Dual-Stack Hosts", RFC 6555, April 2012. [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, "TCP Extensions for Multipath Operation with Multiple Addresses", RFC 6824, January 2013. [RFC7045] Carpenter, B. and S. Jiang, "Transmission and Processing of IPv6 Extension Headers", RFC 7045, December 2013. [Wischik10] Wischik, D., Raiciu, C., and M. Handley, "Balancing resource pooling and equipoise in multipath transport", 8th USENIX Symposium on Networked Systems Design and Implementation, San Jose , April 2010. Authors' Addresses Naderi & Carpenter Expires October 24, 2015 [Page 11] Internet-Draft IPv6 Probing April 2015 Habib Naderi Department of Computer Science University of Auckland PB 92019 Auckland 1142 New Zealand Email: habib@cs.auckland.ac.nz Brian Carpenter (editor) Department of Computer Science University of Auckland PB 92019 Auckland 1142 New Zealand Email: brian.e.carpenter@gmail.com Naderi & Carpenter Expires October 24, 2015 [Page 12]