ISP Column - January 2009 A Tale of Two Protocols: IPv4, IPv6, MTUs and Fragmentation I have seen a number of commentaries and presentations in recent times that claim that IPv6 is identical to IPv4 in every respect except one: namely more addresses. Of course that's not just "more" addresses in the sense that 128 addresses are "more" than 32 addresses, but that's 2 to the power 96 times "more" addresses. Here we're talking massively, unimaginably massively, "more" addresses in IPv6! I must admit to some sympathy for such a claim given that I find the assertions that IPv6 provides superior QoS capability, better, security, improved mobility support or better anything else, as compared to IPv4, to be an expression of largely wishful thinking. There have been some minor tweaks in IPv6 in this respect, but nothing very major. But there is one rather critical difference, and that is the deliberate change in the IPv6 with respect to MTU handling and packet fragmentation, and this relatively minor change in IPv6 has some really quite critical implications. In this article I'd like to illustrate some of the implications of this change with respect to the IPv6 treatment of packet fragmentation by taking an in-depth look at the IPv6 packet flows and why and how this change to packet fragmentation management can cause service-level disruption. But first lets start with what has triggered my attention to this topic of IPv6 packet fragmentation. On my laptop running Mac OS X I have two web browsers, the Apple-provided Safari browser, and Firefox. A situation I encountered recently was that I could enter precisely the same URL into both browsers at the same time. The Firefox browser would correctly display the document. The Safari screen would display a status message saying that it is waiting for the server to provide the requested data, and the screen remains blank indefinitely. Well maybe not indefinitely, but I ran out of patience after about 2 minutes, given that the other browser was able to display the page immediately. After restarting the browser and trying a few more times it was clear that whatever the problem was, it was not a transient condition, and it could be exercised on demand. What I had encountered was one browser working as expected, while the other did not, with both browsers sitting on the same host, retrieving the same URL, at the same time. Is this a case of a bug being exercised in Safari? Or some special feature of Firefox that allows the page to be displayed? Or something completely random? Not at all! As we'll see, Safari is working perfectly. In fact every component on my Mac is working correctly. And every component of the remote server is also working correctly. So if there no nothing wrong at either end, then why does one browser correctly display the web page and the other display nothing, particularly when it looks like there is nothing out of the ordinary happening with my local configuration, the remote server or with the file being retrieved. I suspect that this kind of behaviour is the very behaviour that keeps sites such as http://www.google.com from operating in dual stack mode and why, so far, they've put Google's IPv6 support at a different URL, namely http://ipv6.google.com. The first step along the diagnostic path to look at the URL being retrieved: http://www.rfc-editor.org/authors/rfc5398.txt. This is a transient URL, and the link will disappear once the document has been published as an RFC. But any large document will do, and if you are wanting to see if you can replicate the problem in your corner of the net, try http://www.rfc-editor.org/rfc-style-guide/rfc-style-manual-08.txt. As we'll see, the only real precondition here is that the document is larger than 1420 bytes. The URL object is a plain text object, so there's nothing special in the retried object that would trigger a different response between the two browsers. What about the server itself? The Environment Dual Stack Web Server It appears that the remote server is configured as a dual stack system. A couple of DNS queries can confirm this: $ dig +short www.rfc-editor.org A 128.9.160.27 $ dig +short www.rfc-editor.org AAAA 2001:1878:400:1:214:4fff:fe67:9351 This server, www.rfc-editor.org is a dual homed site, supporting both IPv4 and IPv6. This is generally considered the right thing to do if you want to get all those cute little green ticks in the IPv6 readiness tables, so its reasonable to conclude that the RFC Editor folk are really trying to do the right thing here, which is highly laudable. So what is going wrong with my system when it tries to access this web site using the Safari browser? Dual Stack Client My Mac is also a dual stack system. $ ifconfig en0 en0: flags=8863 mtu 1500 inet6 fe80::217:f2ff:fec9:1b10%en0 prefixlen 64 scopeid 0x4 inet 203.10.60.24 netmask 0xffffff00 broadcast 203.10.60.255 In IPv4 the Ethernet interface my system has the address 203.10.60.24, and in IPv6 this interface uses the address 2001:dc0:2001:10:217:f2ff:fec9:1b10. Dual Stack Web Browser The typical behaviour of a web browser on a dual-stack host is to first perform an IPv4 and an IPv6 address query for the remote site. If there is an IPv6 address record returned, then the browser will first attempt an IPv6 connection to the server, and will fall back to IPv4 in the connection attempt fails after three attempts. If the initial TCP SYN packet exchange succeeds then the association of the AAA record (and implicitly the use of IPv6 as the connection protocol) is locally cached with the server name, and repeated connections to the same remote server name will invoke an IPv6 TCP connection without any delay. To illustrate this behaviour, here's a packet dump of a Safari connection request to http://www.ripe.net, where the first set of packets are the DNS queries and responses. ** concurrent DNS queries for A and AAA A resource records 09:51:52.878779 IP (tos 0x0, ttl 64, id 1595, offset 0, flags [none], proto UDP (17), length 58) dhcp24.potaroo.net.55123 > uneeda.telstra.net.domain: [udp sum ok] 31014+ A? www.ripe.net. (30) These are tcpdump reports. The dump shows the time that the packet was captured, followed by some details of the packet. 09:51:52.878779 This packet was captured at 9:51 am. IP The protocol type, in this case IPv4. (tos 0x0, ttl 64, if 1595, offset 0, flags [none], proto UDP (17), length 58) The IP packet header. The TOS field value of zero. The TTL field value is 64, and the packet identification value is 1595. The packet has not been fragmented, as the offset value of zero and the More Fragments bit flag is clear. Fragmentation is permitted, as the Don't Fragment flag is clear. The packet is a UDP packet. dhcp24.potaroo.net.55123 > uneeda.telstra.net.domain: The IP packet is addressed to 139.130.4.4 (uneeda.telstra.net) from 203.10.60.24 (dhcp24.potaroo.net). The destination UDP port is 53 ("domain"), and the source port is 55123. [udp sum ok] The UDP pseudo header checksum is ok. 31014+ A? www.ripe.net. The query has the identifier value 3104, and is a DNS query for the A Resource Record for the DNS name "www.ripe.net". 09:51:52.879015 IP (tos 0x0, ttl 64, id 24361, offset 0, flags [none], proto UDP (17), length 58) dhcp24.potaroo.net.60459 > uneeda.telstra.net.domain: [udp sum ok] 55689+ AAAA? www.ripe.net. (30) ** DNS responses for A and AAAA resource records 09:51:53.058304 IP (tos 0x0, ttl 59, id 27655, offset 0, flags [none], proto UDP (17), length 99) uneeda.telstra.net.domain > dhcp24.potaroo.net.55123: [udp sum ok] 31014 q: A? www.ripe.net. 2/0/0 www.ripe.net. CNAME aquila-www.ripe.net., aquila-www.ripe.net. A 193.0.19.25 (71) 09:51:53.061607 IP (tos 0x0, ttl 59, id 27663, offset 0, flags [none], proto UDP (17), length 111) uneeda.telstra.net.domain > dhcp24.potaroo.net.60459: [udp sum ok] 55689 q: AAAA? www.ripe.net. 2/0/0 www.ripe.net. CNAME aquila-www.ripe.net., aquila-www.ripe.net. AAAA 2001:610:240:11::c100:1319 (83) The packet dump of the initial DNS query shows that the browser learns both IPv6 and IPv4 addresses for the host name www.ripe.net. The actual server is aquila-www.ripe.net, and the server's IPv4 address is 193.0.19.25, and its IPv6 address is 2001:610:240:11::c100:1319. The browser appears to wait for both DNS queries to complete before proceeding. As there is an IPv6 address for this server, the browser will then attempt a connection request using TCP over IPv6, and will send an initial SYN packet. Here's the packet dump of the initial TCP handshake. ** IPv6 TCP connection handshake 09:51:53.063157 IP6 (hlim 64, next-header: TCP (6), length: 44) 2001:dc0:2001:10:217:f2ff:fec9:1b10.50680 > aquila.ripe.net.http: S, cksum 0x1a63 (correct), 2958875177:2958875177(0) win 65535 09:51:53.454359 IP6 (hlim 53, next-header: TCP (6), length: 40) aquila.ripe.net.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.50680: S, cksum 0xc9e1 (correct), 2390113939:2390113939(0) ack 2958875178 win 5712 09:51:53.454476 IP6 (hlim 64, next-header: TCP (6), length: 32) 2001:dc0:2001:10:217:f2ff:fec9:1b10.50680 > aquila.ripe.net.http: ., cksum 0x8e36 (correct), 1:1(0) ack 1 win 32803 If the TCP three-way handshake completes, then the browser will "lock on" to using IPv6 for this server name, and will then send the HTTP request, and the server will send its response. ** HTTP request and document response, over IPv6 09:51:53.455089 IP6 (hlim 64, next-header: TCP (6), length: 400) 2001:dc0:2001:10:217:f2ff:fec9:1b10.50680 > aquila.ripe.net.http: P, cksum 0x6093 (correct), 1:369(368) ack 1 win 32803 GET / HTTP/1.1 User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_5; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q= 0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us Accept-Encoding: gzip, deflate Connection: keep-alive Host: www.ripe.net ** Server's ACK of the request 09:51:53.865482 IP6 (hlim 53, next-header: TCP (6), length: 32) aquila.ripe.net.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.50680: ., cksum 0x0b19 (correct), 1:1(0) ack 369 win 53 ** Server commences download of the requested document 09:51:53.890642 IP6 (hlim 53, next-header: TCP (6), length: 1320) aquila.ripe.net.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.50680: ., cksum 0x41e3 (correct), 1:1289(1288) ack 369 win 53 HTTP/1.1 200 OK Date: Wed, 10 Dec 2008 22:51:53 GMT Server: Apache/2 Accept-Ranges: bytes Keep-Alive: timeout=8, max=100 Connection: Keep-Alive Transfer-Encoding: chunked Content-Type: text/html [data download...] If there is no AAAA record for the server name, then the browser will use the IPv4 address and attempt the connection using TCP over IPv4. Here's a connection trace to the IPv4-only server www.google.com: ** concurrent DNS queries for A and AAA A resource records 09:52:07.661336 IP (tos 0x0, ttl 64, id 46622, offset 0, flags [none], proto UDP (17), length 60) dhcp24.potaroo.net.63484 > uneeda.telstra.net.domain: [udp sum ok] 13014+ A? www.google.com. (32) 09:52:07.661434 IP (tos 0x0, ttl 64, id 47662, offset 0, flags [none], proto UDP (17), length 60) dhcp24.potaroo.net.53534 > uneeda.telstra.net.domain: [udp sum ok] 20578+ AAAA? www.google.com. (32) ** DNS responses for A resource record and no AAAA resource record 09:52:07.746621 IP (tos 0x0, ttl 59, id 5416, offset 0, flags [none], proto UDP (17), length 144) uneeda.telstra.net.domain > dhcp24.potaroo.net.63484: [udp sum ok] 13014 q: A? www.google.com. 5/0/0 www.google.com. CNAME www.l.google.com., www.l.google.com. A 209.85.173.99, www.l.google.com. A 209.85.173.103, www.l.google.com. A 209.85.173.104, www.l.google.com. A 209.85.173.147 (116) 09:52:07.750023 IP (tos 0x0, ttl 59, id 5426, offset 0, flags [none], proto UDP (17), length 128) uneeda.telstra.net.domain > dhcp24.potaroo.net.53534: [udp sum ok] 20578 q: AAAA? www.google.com. 1/1/0 www.google.com. CNAME www.l.google.com. ns: l.google.com. SOA f.l.google.com. dns-admin.google.com. 1365498 900 900 1800 60 (100) ** IPv4 TCP connection handshake 09:52:07.751608 IP (tos 0x0, ttl 64, id 7987, offset 0, flags [DF], proto TCP (6), length 64) dhcp24.potaroo.net.50684 > mh-in-f99.google.com.http: S, cksum 0x6127 (correct), 699722949:699722949(0) win 65535 09:52:07.945646 IP (tos 0x0, ttl 48, id 56566, offset 0, flags [none], proto TCP (6), length 60) mh-in-f99.google.com.http > dhcp24.potaroo.net.50684: S, cksum 0xc284 (correct), 1111088365:1111088365(0) ack 699722950 win 5672 09:52:07.945760 IP (tos 0x0, ttl 64, id 39278, offset 0, flags [DF], proto TCP (6), length 52) dhcp24.potaroo.net.50684 > mh-in-f99.google.com.http: ., cksum 0x0758 (correct), 1:1(0) ack 1 win 65535 ** HTTP request and document response, over IPv4 09:52:07.946356 IP (tos 0x0, ttl 64, id 1629, offset 0, flags [DF], proto TCP (6), length 898) dhcp24.potaroo.net.50684 > mh-in-f99.google.com.http: P, cksum 0x185b (correct), 1:847(846) ack 1 win 65535 GET / HTTP/1.1 User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_5; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 Accept: text/xml,application/xml,application/xhtml+xml,text/html; q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us Accept-Encoding: gzip, deflate Cookie: rememberme=true; __utmx=173272373.; __utmxx=173272373.; SID=DQAAAHgAAACE_3ogvcIiR6ultKmbZ5_Y-AkYVLkGp55EL1-4cUqvHEmh8 Kjm5oUWqWe_fST6hXBA57WrIpr-seELG7gSIuK3LNYbCynJ_RIat_weZZn5Df LuG8CvQ3iLQJYdLKTd3Ak5dZEkrKqtqdpskrCnQD1pcVFO6k1tfgN7euV3yWP NQA; NID=17=Y-iTGpiaWGhSMMeczPXYLvN7mnOgNSD6RniNpe-IFvR1RjqgP z1_ApuLJ2aJEl973kIrcS3MkbeoZPqRoaCIaVgs5VbbPRjnW7m0OMm3YUTKm2 RZEUYW_lwLi5h7pz9K; TZ=-660; PREF=ID=707b2eb5047c4408:TM=1226 974755:LM=1228161958:GM=1:S=VLSk7abZNZuXVtHX Connection: keep-alive Host: www.google.com 09:52:08.170216 IP (tos 0x0, ttl 48, id 56567, offset 0, flags [none], proto TCP (6), length 52) mh-in-f99.google.com.http > dhcp24.potaroo.net.50684: ., cksum 0x02b5 (correct), 1:1(0) ack 847 win 116 09:52:08.177759 IP (tos 0x0, ttl 48, id 56568, offset 0, flags [none], proto TCP (6), length 645) mh-in-f99.google.com.http > dhcp24.potaroo.net.50684: P, cksum 0xee85 (correct), 1:594(593) ack 847 win 116 HTTP/1.1 302 Found Location: http://www.google.com.au/ Cache-Control: private Content-Type: text/html; charset=UTF-8 Set-Cookie: __utmx=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com Set-Cookie: __utmxx=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com Date: Wed, 10 Dec 2008 22:52:08 GMT Server: gws Content-Length: 222 [data download...] The other connection case is where there is both IPv4 and IPv6 addresses and the IPv6 address does not respond. To look at the way in which various systems behave I've set up a domain name, impossible.rand.apnic.net that has both IPv6 and IPv4 addresses. The IPv6 address is 5000::1, which is, deliberately, an unconnected network. The IPv4 address points to a valid web server at 203.119.0.116. The behaviours of various browsers on various host OS platforms is shown here: Safari on Mac OSX ** IPV6 TCP SYN #1 14:23:03.837528 IP6 (hlim 64, next-header: TCP (6), length: 44) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x70c0 (correct), 3375473401:3375473401(0) win 65535 ** +1 seconds: IPv6 TCP SYN #2 14:23:04.788755 IP6 (hlim 64, next-header: TCP (6), length: 44) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x70b7 (correct), 3375473401:3375473401(0) win 65535 ** +1 seconds: IPv6 TCP SYN #3 14:23:05.789683 IP6 (hlim 64, next-header: TCP (6), length: 44) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x70ad (correct), 3375473401:3375473401(0) win 65535 ** +1 seconds: IPv6 TCP SYN #4 - drop TCP options 14:23:06.790876 IP6 (hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x9c26 (correct), 3375473401:3375473401(0) win 65535 ** +1 seconds: IPv6 TCP SYN #5 14:23:07.792074 IP6 (hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x9c26 (correct), 3375473401:3375473401(0) win 65535 ** +1 seconds: IPv6 TCP SYN #6 14:23:08.793307 IP6 (hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x9c26 (correct), 3375473401:3375473401(0) win 65535 ** +2 seconds: IPv6 TCP SYN #7 14:23:10.795789 IP6 (hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x9c26 (correct), 3375473401:3375473401(0) win 65535 ** +4 seconds: IPv6 TCP SYN #8 14:23:14.800180 IP6 (hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x9c26 (correct), 3375473401:3375473401(0) win 65535 ** +8 seconds: IPv6 TCP SYN #9 14:23:22.810059 IP6 (hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x9c26 (correct), 3375473401:3375473401(0) win 65535 ** +16 seconds: IPv6 TCP SYN #10 14:23:38.829149 IP6 (hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x9c26 (correct), 3375473401:3375473401(0) win 65535 ** +32.0 seconds: IPv6 TCP SYN #11 14:24:10.868094 IP6 (hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:217:f2ff:fec9:1b10.52942 > 5000::1.http: S, cksum 0x9c26 (correct), 3375473401:3375473401(0) win 65535 ** After 75 seconds and 11 IPv6 connection attempts, Safari now attempts an IPv4 connection 14:24:18.402774 IP (tos 0x0, ttl 64, id 3691, offset 0, flags [DF], proto TCP (6), length 64) dhcp24.potaroo.net.52946 > wattle.apnic.net.http: S, cksum 0x9348 (correct), 1099161300:1099161300(0) win 65535 ** remote system TCP response 14:24:18.435474 IP (tos 0x0, ttl 56, id 7273, offset 0, flags [DF], proto TCP (6), length 64) wattle.apnic.net.http > dhcp24.potaroo.net.52946: S, cksum 0x0a60 (correct), 3703888493:3703888493(0) ack 1099161301 win 65535 ** Completion of TCP 3 way handshake 14:24:18.435571 IP (tos 0x0, ttl 64, id 17232, offset 0, flags [DF], proto TCP (6), length 52) dhcp24.potaroo.net.52946 > wattle.apnic.net.http: ., cksum 0x49cc (correct), 1:1(0) ack 1 win 65535 ** Client to Server: HTTP Get command 14:24:18.436202 IP (tos 0x0, ttl 64, id 64549, offset 0, flags [DF], proto TCP (6), length 586) dhcp24.potaroo.net.52946 > wattle.apnic.net.http: P, cksum 0x5925 (correct), 1:535(534) ack 1 win 65535 ** Server to Client: ACK 14:24:18.470446 IP (tos 0x0, ttl 56, id 7276, offset 0, flags [DF], proto TCP (6), length 52) wattle.apnic.net.http > dhcp24.potaroo.net.52946: ., cksum 0x852f (correct), 1:1(0) ack 1 win 50297 ** Server to Client: First data packet 14:24:18.506189 IP (tos 0x0, ttl 56, id 7277, offset 0, flags [DF], proto TCP (6), length 1400) wattle.apnic.net.http > dhcp24.potaroo.net.52946: ., cksum 0x83ec (correct), 1:1349(1348) ack 535 win 50297 In this case Safari took 75 seconds and 11 IPv6 connection attempts before flipping over to IPv4. Internet Explorer on Windows XP ** IPV6 TCP SYN #1 15:27:13.294675 IP6 (hlim 64, next-header: TCP (6), length: 24) 2001:DC0:2001:10:7151:ADF3:D2EA:6018.1030 > 5000::1.80: S, cksum 0xde0c (correct), 3602820861:3602820861(0) win 16384 ** +3 seconds: IPv6 TCP SYN #2 15:27:16.566773 IP6 (hlim 64, next-header: TCP (6), length: 24) 2001:DC0:2001:10:7151:ADF3:D2EA:6018.1030 > 5000::1.80: S, cksum 0xde0c (correct), 3602820861:3602820861(0) win 16384 ** +6 seconds: IPv6 TCP SYN #3 15:27:23.129986 IP6 (hlim 64, next-header: TCP (6), length: 24) 2001:DC0:2001:10:7151:ADF3:D2EA:6018.1030 > 5000::1.80: S, cksum 0xde0c (correct), 3602820861:3602820861(0) win 16384 ** After 23 seconds and 3 IPv6 connection attempts, Safari now attempts an IPv4 connection 15:27:36.255871 IP (tos 0x0, ttl 128, id 1537, offset 0, flags [DF], proto: TCP (6), length: 48) xp12.telstra.net.1093 > wattle.apnic.net.80: S, cksum 0x31d9 (correct), 3032925425:3032925425(0) win 65535 ** remote system TCP response 15:27:36.291706 IP (tos 0x0, ttl 56, id 8329, offset 0, flags [DF], proto: TCP (6), length: 48) wattle.apnic.net.80 > xp12.telstra.net.1093: S, cksum 0x5600 (correct), 1260360205:1260360205(0) ack 3032925426 win 65535 ** Completion of TCP 3 way handshake 15:27:36.291799 IP (tos 0x0, ttl 128, id 1539, offset 0, flags [DF], proto: TCP (6), length: 40) xp12.telstra.net.1093 > wattle.apnic.net.80: ., cksum 0x815f (correct), 1:1(0) ack 1 win 65535 ** Server to Client: HTTP Get Command 15:27:36.292347 IP (tos 0x0, ttl 128, id 1540, offset 0, flags [DF], proto: TCP (6), length: 481) xp12.telstra.net.1093 > wattle.apnic.net.80: P, cksum 0xd6e2 (correct), 1:442(441) ack 1 win 65535 ** Server to Client: First data packet 15:27:36.357766 IP (tos 0x0, ttl 56, id 8331, offset 0, flags [DF], proto: TCP (6), length: 1400) wattle.apnic.net.80 > xp12.telstra.net.1093: ., cksum 0x68a4 (correct), 1:1361(1360) ack 442 win 65535 In this case Explorer took 23 seconds and 3 IPv6 connection attempts before flipping over to IPv4. Firefox on FreeBSD ** IPV6 TCP SYN #1 15:03:15.510137 IP6 (flowlabel 0x92322, hlim 64, next-header: TCP (6), length: 40) 2001:dc0:2001:10:20e:7fff:feac:d687.53162 > 5000::1.http: S, 284152944:284152944(0) win 65535 ** +3 seconds: IPv6 TCP SYN #2 15:03:18.509278 IP6 (flowlabel 0x92322, hlim 64, next-header: TCP (6), length: 40) 2001:dc0:2001:10:20e:7fff:feac:d687.53162 > 5000::1.http: S, 284152944:284152944(0) win 65535 ** +3 seconds: IPv6 TCP SYN #3 15:03:21.709264 IP6 (flowlabel 0x92322, hlim 64, next-header: TCP (6), length: 40) 2001:dc0:2001:10:20e:7fff:feac:d687.53162 > 5000::1.http: S, 284152944:284152944(0) win 65535 ** +6 seconds: IPv6 TCP SYN #4 - note the change of TCP options 15:03:28.109243 IP6 (flowlabel 0x92322, hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:20e:7fff:feac:d687.53162 > 5000::1.http: S, 284152944:284152944(0) win 65535 ** +3 seconds: IPv6 TCP SYN #5 15:03:31.309229 IP6 (flowlabel 0x92322, hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:20e:7fff:feac:d687.53162 > 5000::1.http: S, 284152944:284152944(0) win 65535 ** +6 seconds: IPv6 TCP SYN #6 15:03:37.509202 IP6 (flowlabel 0x92322, hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:20e:7fff:feac:d687.53162 > 5000::1.http: S, 284152944:284152944(0) win 65535 ** +12 seconds: IPv6 TCP SYN #7 15:03:49.709159 IP6 (flowlabel 0x92322, hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:20e:7fff:feac:d687.53162 > 5000::1.http: S, 284152944:284152944(0) win 65535 ** +24 seconds: IPv6 TCP SYN #8 15:04:13.909056 IP6 (flowlabel 0x92322, hlim 64, next-header: TCP (6), length: 28) 2001:dc0:2001:10:20e:7fff:feac:d687.53162 > 5000::1.http: S, 284152944:284152944(0) win 65535 ** After 75 seconds and 8 IPv6 connection attempts, Safari now attempts an IPv4 connection 15:04:30.509305 IP (tos 0x0, ttl 64, id 4878, offset 0, flags [DF], proto: TCP (6), length: 60) workdog.potaroo.net.64421 > wattle.apnic.net.http: S, 3881266848:3881266848(0) win 65535 ** remote system TCP response 15:04:30.542713 IP (tos 0x0, ttl 56, id 47536, offset 0, flags [DF], proto: TCP (6), length: 64) wattle.apnic.net.http > workdog.potaroo.net.64421: S, 597175425:597175425(0) ack 3881266849 win 65535 ** Completion of TCP 3 way handshake 15:04:30.542788 IP (tos 0x0, ttl 64, id 4879, offset 0, flags [DF], proto: TCP (6), length: 52) workdog.potaroo.net.64421 > wattle.apnic.net.http: ., 1:1(0) ack 1 win 8256 ** Server to Client: HTTP Get Command 15:04:30.542935 IP (tos 0x0, ttl 64, id 4880, offset 0, flags [DF], proto: TCP (6), length: 465) workdog.potaroo.net.64421 > wattle.apnic.net.http: P, 1:414(413) ack 1 win 8256 ** Server to Client: ACK 15:04:30.579270 IP (tos 0x0, ttl 56, id 47537, offset 0, flags [DF], proto: TCP (6), length: 52) wattle.apnic.net.http > workdog.potaroo.net.64421: ., 1:1(0) ack 1 win 50297 ** Server to Client: First data packet 15:04:30.606468 IP (tos 0x0, ttl 56, id 47539, offset 0, flags [DF], proto: TCP (6), length: 1400) wattle.apnic.net.http > workdog.potaroo.net.64421: ., 1:1349(1348) ack 414 win 50297 In this case Firefox took 75 seconds and 8 IPv6 connection attempts before flipping over to IPv4. The Observed Problem Back to the problem. What I was seeing on my system was the Firefox would reliably display the document, yet, not matter how I tried I could not get Safari to display the same document. All Safari displayed was a blank screen and a status message saying that Safari was "connecting" Firstly, here's a dump of the Firefox packets, showing a working connection: ** IPv4 TCP connection handshake 09:14:13.807595 IP (tos 0x0, ttl 64, id 884, offset 0, flags [DF], proto TCP (6), length 64) dhcp20.potaroo.net.49971 > www.rfc-editor.org.http: S, cksum 0x7a82 (correct), 707719249:707719249(0) win 65535 09:14:14.022077 IP (tos 0x0, ttl 49, id 24131, offset 0, flags [DF], proto TCP (6), length 64) www.rfc-editor.org.http > dhcp20.potaroo.net.49971: S, cksum 0x388f (correct), 3785256116:3785256116(0) ack 707719250 win 49232 09:14:14.022191 IP (tos 0x0, ttl 64, id 24364, offset 0, flags [DF], proto TCP (6), length 52) dhcp20.potaroo.net.49971 > www.rfc-editor.org.http: ., cksum 0x39a9 (correct), 1:1(0) ack 1 win 65535 ** Client to Server: HTTP GET request for document 09:14:14.022318 IP (tos 0x0, ttl 64, id 19744, offset 0, flags [DF], proto TCP (6), length 565) dhcp20.potaroo.net.49971 > www.rfc-editor.org.http: P, cksum 0xe0ff (correct), 1:514(513) ack 1 win 65535 GET /authors/rfc5398.txt HTTP/1.1 Host: www.rfc-editor.org User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4 Accept: text/html,application/xhtml+xml,application/xml; q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive If-Modified-Since: Fri, 05 Dec 2008 23:42:40 GMT If-None-Match: "962f-1544-3b93fc00" Cache-Control: max-age=0 ** Server to Client: ACK of the HTTP GET Request 09:14:14.245804 IP (tos 0x0, ttl 49, id 24132, offset 0, flags [DF], proto TCP (6), length 52) www.rfc-editor.org.http > dhcp20.potaroo.net.49971: ., cksum 0x7942 (correct), 1:1(0) ack 514 win 48719 ** Server to Client: HTTP Server OK response 09:14:14.274626 IP (tos 0x0, ttl 49, id 24133, offset 0, flags [DF], proto TCP (6), length 372) www.rfc-editor.org.http > dhcp20.potaroo.net.49971: P, cksum 0x04ac (correct), 1:321(320) ack 514 win 48719 HTTP/1.1 200 OK Date: Mon, 08 Dec 2008 22:14:14 GMT Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.8e DAV/2 Last-Modified: Mon, 08 Dec 2008 22:10:23 GMT ETag: "962f-14fd-4b1231c0" Accept-Ranges: bytes Content-Length: 5373 Keep-Alive: timeout=5, max=100 Connection: Keep-Alive Content-Type: text/plain ** Client to Server: ACK of first packet 09:14:14.274728 IP (tos 0x0, ttl 64, id 36637, offset 0, flags [DF], proto TCP (6), length 52) dhcp20.potaroo.net.49971 > www.rfc-editor.org.http: ., cksum 0x364e (correct), 514:514(0) ack 321 win 65535 ** Server to Client: First of four back-to-back packets (sender is using an initial window size of 4) 09:14:14.284709 IP (tos 0x0, ttl 49, id 24134, offset 0, flags [DF], proto TCP (6), length 1500) www.rfc-editor.org.http > dhcp20.potaroo.net.49971: ., cksum 0x7b84 (correct), 321:1769(1448) ack 514 win 48719 [data] ** Server to Client: Second data packet 09:14:14.293156 IP (tos 0x0, ttl 49, id 24135, offset 0, flags [DF], proto TCP (6), length 1500) www.rfc-editor.org.http > dhcp20.potaroo.net.49971: P, cksum 0xd372 (correct), 1769:3217(1448) ack 514 win 48719 [data] ** Client to Server: ACK generated for the first two packets 09:14:14.293227 IP (tos 0x0, ttl 64, id 23065, offset 0, flags [DF], proto TCP (6), length 52) dhcp20.potaroo.net.49971 > www.rfc-editor.org.http: ., cksum 0x2b0b (correct), 514:514(0) ack 3217 win 65522 ** Server to Client: Third data packet 09:14:14.497735 IP (tos 0x0, ttl 49, id 24136, offset 0, flags [DF], proto TCP (6), length 1500) www.rfc-editor.org.http > dhcp20.potaroo.net.49971: ., cksum 0x8d33 (correct), 3217:4665(1448) ack 514 win 48719 [data] ** Server to Client: final data packet 09:14:14.503974 IP (tos 0x0, ttl 49, id 24137, offset 0, flags [DF], proto TCP (6), length 1081) www.rfc-editor.org.http > dhcp20.potaroo.net.49971: P, cksum 0x2819 (correct), 4665:5694(1029) ack 514 win 48719 [data] ** Client to Server: ACK of data 09:14:14.504066 IP (tos 0x0, ttl 64, id 48896, offset 0, flags [DF], proto TCP (6), length 52) dhcp20.potaroo.net.49971 > www.rfc-editor.org.http: ., cksum 0x2137 (correct), 514:514(0) ack 5694 win 65535 Clearly, Firefox is using IPv4 for the retrieval. This is not surprising, as when I look at about:config in the Firefox browser, I can see that the local variable network.dns.disableIPv6 to True. The document is 5373 bytes long, and is downloaded in its entirety in four IPv4 packets, three of 1500 bytes and the final packet of 1081 bytes. Three of these packets have a payload of 1448 bytes, coupled with 20 byte TCP header and a 10 byte TCP timestamp option header, a 2 byte padding field and a 20 byte IPv4 packet header, while the fourth packet contains the remaining 1029 bytes of the document. The IPv4 packet headers have the DF fragment bit set, preventing fragmentation of the packet in transit, as the Apache server driver uses Path MTU discovery and explicitly prevents intermediate systems performing Ipv4 packet fragmentation. What happens when Safari is used to download the same page? Safari is operating in dual stack mode and the DNS query has resulted in both AAAA and A records, so, as described above, Safari will attempt to undertake the page retrieval using IPv6. Here's the packet dump. ** Start TCP session with three-way handshake. 09:14:47.239483 IP6 (hlim 64, next-header: TCP (6), length: 44) 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978 > www.rfc-editor.org.http: S, cksum 0x3306 (correct), 1585949805:1585949805(0) win 65535 < mss 1440,nop,wscale 5,nop,nop,timestamp 372220498 0,sackOK,eol> 09:14:47.523248 IP6 (hlim 53, next-header: TCP (6), length: 44) www.rfc-editor.org.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978: S, cksum 0x8667 (correct), 994602419:994602419(0) ack 1585949806 win 49980 09:14:47.523340 IP6 (hlim 64, next-header: TCP (6), length: 32) 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978 > www.rfc-editor.org.http: ., cksum 0x0a39 (correct), 1:1(0) ack 1 win 32799 ** Client to Server: HTTP GET request for document 09:14:47.523780 IP6 (hlim 64, next-header: TCP (6), length: 451) 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978 > www.rfc-editor.org.http: P, cksum 0x7b3b (correct), 1:420(419) ack 1 win 32799 GET /authors/rfc5398.txt HTTP/1.1 User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_5; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 Cache-Control: max-age=0 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9, text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us Accept-Encoding: gzip, deflate Connection: keep-alive Host: www.rfc-editor.org ** Server to Client: ACK of the HTTP GET Request 09:14:47.825010 IP6 (hlim 53, next-header: TCP (6), length: 32) www.rfc-editor.org.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978: ., cksum 0xc6ff (correct), 1:1(0) ack 420 win 49561 ** Server to Client: HTTP Server OK response 09:14:47.825813 IP6 (hlim 53, next-header: TCP (6), length: 352) www.rfc-editor.org.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978: P, cksum 0x4f67 (correct), 1:321(320) ack 420 win 49561 HTTP/1.1 200 OK Date: Mon, 08 Dec 2008 22:14:47 GMT Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.8e DAV/2 Last-Modified: Mon, 08 Dec 2008 22:10:23 GMT ETag: "962f-14fd-4b1231c0" Accept-Ranges: bytes Content-Length: 5373 Keep-Alive: timeout=5, max=100 Connection: Keep-Alive Content-Type: text/plain ** Client to Server: ACK of the first part of the response 09:14:47.825898 IP6 (hlim 64, next-header: TCP (6), length: 32) 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978 > www.rfc-editor.org.http: ., cksum 0x0741 (correct), 420:420(0) ack 321 win 32789 ** [The Server sends 4 back-to-back data packets as part of its TCP initial window. The Server's 3 full-size packets are dropped within the network - note the stream offset sequence number at the client end jump from 321 to 4605, indicating a drop of three TCP packets each of 1428 bytes of TCP payload] ** Server to Client: last part of the document 09:14:48.102778 IP6 (hlim 53, next-header: TCP (6), length: 1121) www.rfc-editor.org.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978: P, cksum 0x65b0 (correct), 4605:5694(1089) ack 420 win 49561 [data] ** Client to Server: ACK of previous good data - signalling the data drop 09:14:48.102877 IP6 (hlim 64, next-header: TCP (6), length: 44) 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978 > www.rfc-editor.org.http: ., cksum 0x56eb (correct), 420:420(0) ack 321 win 32799 ** [Server attempts to resend the data using three single full sized data packet - which gets dropped within the network] ** Server to client: data transmission has finished - send a FIN TCP packet 09:14:52.820267 IP6 (hlim 53, next-header: TCP (6), length: 32) www.rfc-editor.org.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978: F, cksum 0xaec7 (correct), 5694:5694(0) ack 420 win 49561 ** Client to Server: ACK of previous good data - signalling the data drop 09:14:52.820373 IP6 (hlim 64, next-header: TCP (6), length: 44) 2001:dc0:2001:10:217:f2ff:fec9:1b10.49978 > www.rfc-editor.org.http: ., cksum 0x56bb (correct), 420:420(0) ack 321 win 32799 ** [Server attempts to resend the data using a restart, attempting to send the first full sized data packets - which gets dropped within the network] ** [Client aborts the connection] What the Safari browser shows is a blank screen, with the status text: "Contacting "www.rfc-editor.org". The browser appears to be hung. What is confusing to the user is that while the browser will not respond to this particular URL, it will however respond normally to requests for non-existent documents with the correct error display. Here's a dump of a request to the same server for a non-existent document. ** Client to Server: HTTP GET request of a non-existent document 09:15:04.376570 IP6 (hlim 64, next-header: TCP (6), length: 425) 2001:dc0:2001:10:217:f2ff:fec9:1b10.49980 > www.rfc-editor.org.http: P, cksum 0x31f1 (correct), 394:787(393) ack 467 win 32799 GET /authors/rfc5399.txt HTTP/1.1 User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_5; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 Accept: text/xml,application/xml,application/xhtml+xml,text/html; q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us Accept-Encoding: gzip, deflate Connection: keep-alive Host: www.rfc-editor.org ** Server to Client: ACK of the HTTP GET Request 09:15:04.652396 IP6 (hlim 53, next-header: TCP (6), length: 32) www.rfc-editor.org.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.49980: ., cksum 0x8067 (correct), 467:467(0) ack 787 win 49194 ** Server to Client: Response: No such Document 09:15:04.668977 IP6 (hlim 53, next-header: TCP (6), length: 497) www.rfc-editor.org.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.49980: P, cksum 0xd42e (correct), 467:932(465) ack 787 win 49194 HTTP/1.1 404 Not Found Date: Mon, 08 Dec 2008 22:15:04 GMT Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.8e DAV/2 Content-Length: 217 Keep-Alive: timeout=5, max=99 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 404 Not Found

Not Found

The requested URL /authors/rfc5399.txt was not found on this server.

** Client to Server: ACK of the Server Response 09:15:04.669083 IP6 (hlim 64, next-header: TCP (6), length: 32) 2001:dc0:2001:10:217:f2ff:fec9:1b10.49980 > www.rfc-editor.org.http: ., cksum 0xbead (correct), 787:787(0) ack 932 win 32784 Diagnosis The browser using IPv4 appears to be working without any issues at all, while the IPv6 connection is unable to receive some documents. One possible theory could be that there is some middleware or filter out there that is knocking out the RFC Editor's documents. While some RFCs are a little out of date, and others are of dubious value, I'm pretty sure that even the most incompetent of content filters wouldn't find them offensive! So is it my browser, my operating system or something to do with my configuration? It does not appear to be so. Other web sites are dual stacked, such as http://www.potaroo.net and http://www.apnic.net work just fine for my local system, suing both Firefox and Safari in both IPv4 and IPv6. And IPv6-only sites, such as, ipv6.google.com also appear to work just fine. So is it something about the rfc-editor's web server? I can test this out by attempting to connect using a different location. And when I do that, there is no problem at all. When I connect my Mac to the network at another location, both browsers work just fine. So in looking as to why this is occurring, its likely that this has nothing to do with local configuration, the content of the document, the HTTP protocol or the behaviour of TCP. The major difference here is that of the difference in protocols. A likely hypothesis is that the IPv6 problem here is something about the path between the rfc-editor's web server and my location. As always in the case of network problems, there are two trusty tools: traceroute and ping, or in the case of IPv6 traceroute6 and ping6. So let's use traceroute6 to check if there is an IPv6 network path between my system and the remote web server. $ traceroute6 2001:1878:400:1:214:4fff:fe67:9351 traceroute6 to 2001:1878:400:1:214:4fff:fe67:9351 (2001:1878:400:1:214:4fff:fe67:9351) from 2001:dc0:2001:10:217:f2ff:fec9:1b10, 30 hops max, 12 byte packets 1 2001:dc0:2001:10:2b0:c2ff:fe8d:c034 2.187 ms 2.125 ms 2.108 ms 2 2001:dc0:2001:249::1 35.696 ms 36.683 ms 38.184 ms 3 2001:dc0:2001:255::2 166.429 ms 167.06 ms 168.776 ms 4 2001:200:0:fe00::9c1:1 168.784 ms 169.741 ms 169.496 ms 5 2001:240:bb01:e::7d 164.829 ms 2001:240:bb01:f::7e 170.876 ms 168.719 ms 6 tky009bf01.IIJ.Net 165.025 ms tky001bf01.IIJ.Net 166.906 ms tky009bf00.IIJ.Net 177.922 ms 7 tky001ix02.IIJ.Net 192.543 ms 166.019 ms 174.734 ms 8 2001:7fa:7:1::2914:1 267.687 ms 285.419 ms 268.23 ms 9 xe-5-0-0.a21.tokyjp01.jp.ra.gin.ntt.net 267.699 ms 303.905 ms 271.633 ms 10 xe-9-1.a15.tokyjp01.jp.ra.gin.ntt.net 256.395 ms xe-2-1.a15.tokyjp01.jp.ra.gin.ntt.net 267.366 ms 270.076 ms 11 xe-3-0-0.a21.osakjp01.jp.ra.gin.ntt.net 261.096 ms 257.71 ms 259.382 ms 12 ae-1.r20.osakjp01.jp.bb.gin.ntt.net 267.685 ms 263.266 ms 261.358 ms 13 as-1.r20.lsanca03.us.bb.gin.ntt.net 271.908 ms 266.593 ms 285.734 ms 14 po-1.r00.lsanca03.us.bb.gin.ntt.net 269.095 ms 274.537 ms 289.78 ms 15 ge-0.usc-los.lsanca03.us.bb.gin.ntt.net 272.253 ms 255.659 ms 272.387 ms 16 2001:1878:8::2 287.127 ms 273.509 ms 253.279 ms 17 www.rfc-editor.org 274.352 ms 256.764 ms 275.732 ms It's a bit of a strange path - the outbound path appears to start in Australia and then head on a route to Japan, and then across a second system to the west coast of the US, where the server appears to be located. The essential observation is that there is consistent network connectivity here, so its not a broken network path that is causing the problem. The packet trace reveals that all the packets from my system to the server appear to be getting through, but only some of the packets from the server to my system are making it. In particular, all the small packets are making it through, but the large packets appear to be getting dropped by the network consistently. We can test out this hypothesis with a simple ping test. $ ping6 www.rfc-editor.org PING6(56=40+8+8 bytes) 2001:dc0:2001:10:217:f2ff:fec9:1b10 --> 2001:1878:400:1:214:4fff:fe67:9351 16 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=0 hlim=248 time=272.253 ms 16 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=1 hlim=248 time=258.123 ms 16 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=2 hlim=248 time=283.547 ms 16 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=3 hlim=248 time=252.508 ms 16 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=4 hlim=248 time=272.222 ms 16 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=5 hlim=248 time=278.751 ms 16 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=6 hlim=248 time=256.72 ms ^C --- www.rfc-editor.org ping6 statistics --- 7 packets transmitted, 7 packets received, 0% packet loss round-trip min/avg/max = 252.508/267.732/283.547 ms So far so good. But that used small packets. Lets try much larger packets: $ ping6 -s 1452 www.rfc-editor.org PING6(1500=40+8+1452 bytes) 2001:dc0:2001:10:217:f2ff:fec9:1b10 --> 2001:1878:400:1:214:4fff:fe67:9351 ^C --- www.rfc-editor.org ping6 statistics --- 3 packets transmitted, 0 packets received, 100% packet loss So packets which are set to the local maximum packet size of 1500 bytes are not making it there and back. How about packets that are far smaller that the local maximum size? $ ping6 -s 1400 www.rfc-editor.org PING6(1448=40+8+1400 bytes) 2001:dc0:2001:10:217:f2ff:fec9:1b10 --> 2001:1878:400:1:214:4fff:fe67:9351 1408 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=0 hlim=248 time=321.879 ms 1408 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=1 hlim=248 time=322.99 ms 1408 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=2 hlim=248 time=333.956 ms ^C --- www.rfc-editor.org ping6 statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max = 321.879/326.275/333.956 ms Now we are on to something. Lets see what the maximum packet size that can make it there and back is by looking at a few packet sizes in between. $ ping6 -s 1431 www.rfc-editor.org PING6(1479=40+8+1431 bytes) 2001:dc0:2001:10:217:f2ff:fec9:1b10 --> 2001:1878:400:1:214:4fff:fe67:9351 1439 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=0 hlim=248 time=322.465 ms 1439 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=1 hlim=248 time=339.634 ms ^C --- www.rfc-editor.org ping6 statistics --- 3 packets transmitted, 2 packets received, 33% packet loss round-trip min/avg/max = 322.465/331.049/339.634 ms $ ping6 -s 1432 www.rfc-editor.org PING6(1480=40+8+1432 bytes) 2001:dc0:2001:10:217:f2ff:fec9:1b10 --> 2001:1878:400:1:214:4fff:fe67:9351 1440 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=0 hlim=248 time=357.386 ms 1440 bytes from 2001:1878:400:1:214:4fff:fe67:9351, icmp_seq=1 hlim=248 time=324.929 ms ^C --- www.rfc-editor.org ping6 statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 324.929/341.158/357.386 ms $ ping6 -s 1433 www.rfc-editor.org PING6(1481=40+8+1433 bytes) 2001:dc0:2001:10:217:f2ff:fec9:1b10 --> 2001:1878:400:1:214:4fff:fe67:9351 ^C --- www.rfc-editor.org ping6 statistics --- 3 packets transmitted, 0 packets received, 100% packet loss It appears that packets of size 1480 bytes are just fine, but packets of size 1481 bytes are getting dropped. Now an IPv4 packet header is 20 bytes in length, and when a conventional IPv6-in-IPv4 tunnel is set up, the packet is inflated by the size of a single IPv4 packet header, or 20 bytes. The other observation is that a very prevalent maximum packet size is 1500 octets, or the maximum size of a standard Ethernet frame. Now we are getting closer to the cause of the problem. The IPv6 packet from the server to my system is being sent out from the server as a full size 1500 octet packet, with 1428 bytes of TCP payload, and 32 bytes of TCP header and 40 bytes of IPv6 header. It is highly probably that somewhere on the path the IPv6 packet is placed into an IPv4 tunnel, and the packet size is inflated to 1520 octets with the addition of the IPv4 packet header. At the tunnel ingress the tunnel has a 1500 octet maximum packet size interface and the tunnel is evidently not permitted to perform the IPv4 fragmentation, or the tunnel ingress has a larger MTU but some interior point in the tunnel has a 1500 octet MTU and the tunnel packet has the IPv4 don't fragment bit set. The oversized IPv6 packet is dropped. So when we re-examine the initial IPv6 TCP handshake, the client offered the server a MSS of 1440 bytes, or an equivalent assertion that a 1500 octet packet (accounting for a 40 byte IPv6 packet header and a 20 byte TCP header) would be received by the client. Similarly, the server offered the client the same 1440 byte MSS, or the same equivalent assertion about a 1500 octet MTU. In TCP the lower of the two MSS values is used to determine the maximum-sized packet used by both sides in the connection, and this initial value is used as the starting Path MTU by both sides of the connection. So the problem is that the remote server believes that the MTU of the path to the client is 1500 bytes, while the true MTU is 1480 bytes. So why has the server managed to get confused about the path MTU? Why is this happening? There are four factors that contribute to this problem: firstly, the desire to maximize packet size, secondly, the handling of packet fragmentation in IPv4, thirdly, the handling of tunnels, and lastly, the handling of packet fragmentation in IPv6. Maximizing Packet Size If fragmentation is causing such a problem then shouldn't we try to avoid the problem completely and just use the minimum packet size for all IP packets? Both IPv4 and IPv6 define a "minimum packet size". All IP hosts and routers must be able to pass a packet of this minimum size without resorting to IP fragmentation. The implication is that all IP interfaces must be able to support an MTU of this "minimum size". In IPv4 this is 68 bytes on a hop-by-hop basis, and every IPv4 destination must be able to reassemble a fragmented IPv4 datagram of up to 576 bytes in length. RFC791: Every internet module must be able to forward a datagram of 68 octets without further fragmentation. This is because an internet header may be up to 60 octets, and the minimum fragment is 8 octets. Every internet destination must be able to receive a datagram of 576 octets either in one piece or in fragments to be reassembled. In IPv6 the minimum MTU is 1280 bytes, and every IPv6 destination must be able to reassemble a fragmented IPv6 datagram of up to 1500 bytes in length. RFC2460: IPv6 requires that every link in the internet have an MTU of 1280 octets or greater. On any link that cannot convey a 1280-octet packet in one piece, link-specific fragmentation and reassembly must be provided at a layer below IPv6. ... A node must be able to accept a fragmented packet that, after reassembly, is as large as 1500 octets. A node is permitted to accept fragmented packets that reassemble to more than 1500 octets. An upper-layer protocol or application that depends on IPv6 fragmentation to send packets larger than the MTU of a path should not send packets larger than 1500 octets unless it has assurance that the destination is capable of reassembling packets of that larger size. So why not keep packet sizes small? After all fragmentation is problematical for firewalls and filters, because fragments do not contain the TCP or UDP port addresses that are a conventional part of so many filtering roles, and fragmentation is a problem for the destination host in so far as each fragment that has a new IP identifier causes the destination to open up a new fragmentation reassembly context. One possible answer is carriage efficiency, where the desire is to maximize the ration of data payload to protocol header overhead. The larger the packet size the larger the carriage efficiency. But this is a somewhat esoteric argument in a world of high capacity fibre systems. The most compelling argument is to maximise data performance. Routers and switches have a limit to their switching capacity, and the limit is one of the number of packets per second that the router can switch. If you really want to see your shiny new room-sized massive router get heated, then fragment all your IPv4 packets down to 68 bytes and send it a 100Gpbs of IP traffic to switch. ThatŐs 1.47 billion packets per second! From a switch's perspective the most efficient performance point is when the packet size is maximized. From an end to end performance perspective packet size is also a factor. TCP inflates its sending window by a Maximum Segment Size (MSS) unit for each received ACK in Slow Start and by one MSS per Round Trip Time in congestion avoidance mode. Smaller packet sizes makes TCP less "responsive" to speed increases, which in practical terms has some impact on end-to-end performance under situations of sustained high volume transfers over a network path that is experiencing some level of traffic contention. This is perhaps countered to some small extent by the observation that TCP is a feedback controlled system and the smaller packets increase the packet "density," which, in turn, increases the level of signalling information which is being passed to the sender. However the basic TCP performance algorithm (see The Appendix of the article on "Evolving TCP" for details) says that the bandwidth of a TCP connection is directly proportional to the MSS size and inversely proportional to the Round Trip Time multiplied a factor which can be approximated to the square root of the packet loss rate. So if you can increase the MSS without altering the packet loss rate then you will probably see better TCP performance. Packet Fragmentation in IPv4 According to the IPv4 technical specifications, all IPv4 hosts and routers must pass packets up to 576 bytes in length. Packets of a larger size may be fragmented by the host or the router. The IPv4 Packet Identification, Fragmentation Flags, and Fragment Offset fields comprise a 32-bit IPv4 packet header segment used to control packet fragmentation. The IP specification allows a router to fragment an oversized packet into smaller units that match the MTU of the next network hop. Fragmentation may occur more than once within an end-to-end transit path, and an already fragmented packet may be further fragmented without change to the fragmentation functionality. To accommodate this, and to allow some degree of transit efficiency, no reassembly is attempted within the network, implying that all reassembly is performed at the destination point. IPv4 Fragmentation Control The IPv4 packet header field Total Packet Length is a 16-bit field that specifies the total length of the packet, including both the header and the payload. The value is in units of octets, and the field size of 16 bits allows the maximum IPv4 packet size to be 65,535 octets. All hosts and routers must pass packets up to 576 octets in length without needing to invoke fragmentation. Packets of a larger size may be fragmented by the host or the router. The Packet Identification, Fragmentation Flags, and Fragment Offset fields comprise a 32-bit header segment used to control packet fragmentation. Within the design of the IP protocol, every network has a maximum packet size, or maximum transmission unit (MTU), and these sizes vary. The MTU is an outcome of the network's design and is a product of the network's bandwidth, maximal diameter, and desired imposed jitter. Because sender is not aware of the sequence of MTUs supported on the end-to-end sequence of networks to reach any particular destination, the IPv4 specification allows every router to fragment an oversized IP packet into smaller units that match the MTU of the next hop. Fragmentation may occur more than once within an end-to-end transit path, and an already fragmented packet may be further fragmented without change to the fragmentation functionality. To accommodate this, and to allow some degree of transit efficiency, no fragmentation reassembly is attempted within the network. Once fragmented, a packet is reassembled only at the point of the final destination, as per the IP packet's destination address. Three IPv4 packet header fields are used in fragmentation control: Packet Identifier, Fragmentation Flags, and Fragmentation Offset Value. Packet Identifier: A 16-bit value used to identify all the fragments of a packet, allowing the destination host to perform packet reassembly. Note that the packet identifier value cannot be reused while fragments of a previous incarnation of this identifier value remain within the network. For low- to medium-speed networks, this constraint is not a problem, but at gigabit speeds, the wraparound limitation may prove to be a significant limitation. When using maximum-sized packets, fragmentation identifier wraparound occurs every 32 gigabits. When using the more common 576-byte packet size, the fragmentation identifier wraparound is every 256 megabits. As traffic flow speeds increase, minimum packet sizes will need to increase, while at the same time fragmentation capability may have to be dropped as an available option for IP routers, because the combination of the Packet Identifier field and packet size acts as a limit on the amount of data held in transit within the network. Fragmentation Flags: The three-bit Flag field has the first bit flag reserved. The second bit flag is the Don't Fragment flag. When a router attempts to fragment an IP packet with this flag set, no fragmentation occurs, and an ICMP error message is sent back to the sender to inform of the delivery error, and the packet is discarded. The third bit flag is the More Fragments flag. When a packet is fragmented, all packets except the final fragment have the More Fragments flag set. The fragmentation algorithm operates such that only the final fragment of the original IP packet has this field clear (set to zero). When a fragment is further fragmented, this definition of the final fragment refers to the original packet. Thus, fragmentation of an already fragmented packet where the More Fragments flag is already set results in a sequence of smaller packets where the final packet still has the More Fragments flag set. Fragmentation of an already fragmented packet that was the final fragmented packet, i.e. where the More Fragments flag is clear and the Fragment Offset is non-zero, will result in a sequence of smaller packets where the More Fragments flag is set for all but the final smaller fragment. Fragmentation Offset Value: This 13-bit value counts the offset of the start of this fragment from the start of the original packet. The unit used by this counter is octa-bytes, implying that fragmentation must align to 64-bit boundaries. The fields altered by fragmentation are shown in the figure above, where a 1500-byte IP packet has been fragmented into two 660-byte packets and one 220-byte packet. The IP packet length has been altered to reflect the fragment size, and the Fragmentation Offset Value field has been set to 0, 80, and 160 respectively. The final fragment has the More Fragments flag cleared to show that it is the final fragment of the original packet. IP packet fragmentation is not completely transparent in terms of performance, nor is it very secure. If one fragment is lost within the network, the entire packet is discarded by the receiver, as retransmitting just the lost fragment is not a viable option for the protocol. After all, the sender has no knowledge that the packet was fragmented, nor the manner of the fragmentation split. Fragmentation also has resource implications for the destination, where reception of a single fragment implies that the destination has to buffer the fragment and start a reassembly timer to await the reception of the other fragments before completing reassembly. This can be exploited into a resource exhaustion attack by flooding the destination with bogus packets with the Fragment Offset Value set to some large non-zero value. Rather than rely on IP packet fragmentation to adapt to various per-hop packet contortions, TCP can use a path MTU discovery algorithm, as described in RFC1191, to allow the TCP session to proceed without the use of fragmentation. Applications that use Path MTU discovery send all packets with the Don't Fragment flag bit set, preventing any router from performing fragmentation on the packet flow. When a router cannot forward the packet because of an MTU mismatch the router discards the packet and generates a return ICMP Destination Unreachable message to the source address with an ICMP code of "fragmentation needed and DF set". The router should place the value of the next hop MTU into the ICMP message, if it supports the RFC1191 mechanism. Upon receipt of this IMCP message the TCP source can set its MTU down to the next hop MTU if that information is available in the ICMP message, or perform some other form of MTU reduction to adapt to the path MTU. If all else fails, the sender can drop down to the minimum size IPv4 packet that does not require fragmentation, namely 576 bytes. This algorithm explicitly reverses the information flow of the fragmentation situation. IPv4 without TCP Path MTU discovery essentially signals the information about the fragmentation condition forwards, towards the destination. Path MTU reverses this, and makes the intermediate router send the fragmentation backwards, back towards the message source. This would normally be a pretty trivial semantic change were it not for two additional factors: ICMP filters and tunnels. The problem with this approach is that many filters and firewalls treat ICMP packets as potentially hostile, and they often block the ICMP packet's traversal through the firewall. The second problem is that the use of NATs has the potential to further confound the matter because the NAT may not perform recognise the packet because the incoming packet has the source address of the intermediate system, and therefore may not be recognised as being part of the packet flow belonging to an already established binding. It may also be the case that the NAT may not perform the appropriate address transform on the inner IP packet carried as a payload to the outer ICMP packet. The result is that because the addressed host may not be aware that it is behind a NAT, let alone be aware of the NAT's IP translation binding, the host therefore may not even recognise the ICMP packet as being directed to itself, even if it is passed through the firewalls and NATs. Tunnels Tunnels add to the complexity here. The first problem with a tunnel is that it adds an additional overhead of 20 bytes for a basic IPv4 header, 24 bytes for a GRE tunnel header, or 40 bytes for a UDP-based IPv4 header. There are other forms of tunnels as well, such as the 8 byte PPPoE header, or the overheads of the AH and ESP headers of IPSEC tunnels. This additional header overhead implies that the tunnel's MTU is smaller than the "raw" interface MTU. The second problem with a tunnel is that there may be further tunnels "inside" the tunnel, so that the tunnel ingress is not necessarily aware of the true tunnel MTU. The third problem is that the routing of the interior of the tunnel may change, so that the tunnel MTU may be variable. However, the default behaviour of IPv4 tunnels should be benign, or so you would think. The outer IPv4 "wrapper" will have DF bit on the tunnel header cleared, so that a MTU mismatch with the tunnel will cause the packet to be fragmented. The fragmentation is not visible on an end-to-end basis as the tunnel egress has the responsibility to assemble all the original IP packet fragments. And if you added the IPv4 tunnel wrapper to the packet before attempting to pass it into the tunnel your assumption would be fine. But that's not the way its done. The tunnel is regarded as the same as any other interface, and if the original packet can't fit into the ingress point of the tunnel then it will get fragmented before having the tunnel encapsulation added. So if the original packet has the DF bit set then if the packet is too large to fit into the tunnel interface without fragmentation, then the original packet is discarded and an ICMP message is generated to flag the MTU mismatch. Oddly enough, once the packet has been passed into the tunnel then the default case is that further fragmentation can be performed on the tunnel packet, as the common default option is to use an IPv4 tunnel header with the DF bit cleared. For such fragmentation conditions, packet reassembly is performed at the tunnel endpoint, rather than at the inner packet's ultimate destination. For low speed tunnels this may probably be benign behaviour. For higher speed situations the reassembly packet load at the tunnel egress may be unacceptably high. To avoid this, the tunnel can be configured to map the encapsulated packet's DF bit to the outer wrapper IP packet. As long as the tunnel ingress point is prepared to perform ICMP relay functions and remap the reverse tunnel ICMP message into a message that has the tunnel headers stripped out and the original source and destination addresses placed into the IMCP message header then there is the possibility of allowing the end-to-end Path-MTU discovery to take account of the additional tunnel overhead. It should not be surprising that all of this just gets too complex to maintain operationally, and the pragmatic result of all of these considerations is that most host systems use an MTU of 1500 bytes and most interior routers use an MTU of around 9000 bytes or larger on point-to-point links, and generally avoid going less than 1500 octets on any interior link. As long as tunnels are generally avoided then there is no real path MTU discovery taking place on the Internet, and the firewall and tunnel issues and the NAT treatment of ICMP messages are largely irrelevant in such a uniform MTU environment, and most of the Internet appears to work acceptably well for most of the Internet's users. Of course VPN users can experience some very strange outcomes, but as long as the VPN tunnel points are carefully configured, most VPNs can be made to work relatively robustly. But IPv6 can be different, particularly in its use of auto-tunnelling and its different treatment of packet fragmentation. Packet Fragmentation in IPv6 IPv6 takes a much more stringent approach to packet fragmentation than IPv4. IPv6 assumes that all TCP sessions in IPv6 have Path MTU discovery capability, and also assumes that all UDP applications can also perform some equivalent form of path MTU discovery. The result of this design assumption is that all fragmentation control fields are removed from the base IPv6 packet header. All IPv6 routers, or any other intermediate system, must not attempt to perform packet fragmentation on an IPv6 packet. If an IPv6 packet is too large for the next hop interface, then the router must discard the IPv6 packet and generate an ICMPv6 "Packet too big" message and send this back to the IPv6 source. In the case of TCP the IPv6 host system should perform Path MTU discovery based on these ICMPv6 messages, and avoid performing packet fragmentation at the source. In other cases, such as UDP or "raw" IP, upper level protocol driver may not be able to reformat the original payload data into multiple IPv6 packets, and prefer to have the remote upper level protocol instance receive a single packet payload. IPv6 allows the source of the packet to perform payload fragmentation, and generate a number of Ipv6 packets, each with a fragmentation control header that plays a similar role to the IPv4 fragmentation control fields. IPv6 Fragmentation Control The IPv6 packet header has a 16 bit unsigned Payload Length field, indicating the length of the packet, less the 40 byte IPv6 packet header. This allows for a maximum "normal" IPv6 packet size of 65575 octets. IPv6 may include a jumbogram header that permits larger packets of up to 4G bytes in size, although router support for such large packets is optional. All IPv6 hosts and routers must pass packets up to 1280 octets in length. Fragmentation of a IPv6 packet may only be performed at the source point of the packet. No further fragmentation, nor any form of fragment reassembly, is attempted by any intermediate device. Once fragmented at the source, an IPv6 packet is reassembled only at the point of the final destination, as per the IPv6 packet's destination address. As with IPv4, IPv6 uses three control fields, the Packet Identification, More Fragments Flag, and Fragment Offset fields. These fields are formatted into an 8-byte IPv6 Fragmentation Header, referenced using the IPv6 Next Header code of 44. When a host fragments an IPv6 packet is adds a Fragmentation Header to the IPv6 packet. The header has three control fields: Identification, More Fragments, and Fragmentation Offset: Identification: A 32-bit value used to identify all the fragments of a packet, allowing the destination host to perform packet reassembly. More Fragments Flag: When a packet is fragmented, all packets except the final fragment have the More Fragments flag set. The fragmentation algorithm operates such that only the final fragment of the original IP packet has this field clear (set to zero). Fragmentation Offset Value: This 13-bit value counts the offset of the start of this payload fragment from the start of the orignoinal packet. The unit used by this counter is octa-bytes, implying that fragmentation must align to 64-bit boundaries. The fields altered by fragmentation are shown in the figure above, where a 1500-byte IP packet has been fragmented into a 1280-byte packet and one 316-byte packet. The IP packet length has been altered to reflect the fragment size, and the Fragmentation Offset Value field has been set to 154 respectively. The final fragment has the More Fragments Flag cleared to show that it is the final fragment of the original packet. IPv6 defines an explicit ordering of IPv6 packet headers: - IPv6 packet header - Hop-by-Hop Options header - Destination Options headers (intermediate destinations) - Routing header - Fragment header - Authentication header - Encapsulating Security Payload header - Destination Options header (final destination) - upper-layer header The first four header types form the "unfragmentable part" of the packet, and are reproduced in the headers of every fragment packet, while the final four header types are treated by the IPv6 fragmentation algorithm as a part of the payload, and are not reproduced in every fragment's header. The basic behaviour for IPv6 TCP is to avoid fragmentation. IPv6 TCP uses the local MTU as the initial local MSS value (adjusting the MTU for the IP and TCP packet headers), and then use the minimum of this MSS value and the remote party's MSS value as the session MSS value, deriving an initial MTU value by adjusting to allow for the IP and TCP packet headers. This initial MTU is used as the initial path MTU estimate. One the TCP connection is established the sender will rely on incoming packet-too-big ICMPv6 messages to trigger the local TCP instance to use a smaller MTU, using the MTU indicated in the ICMPv6 packet as the new MTU. So as long as the sender is receiving ICMPv6 messages then TCP should adjust correctly when the initial Path MTU estimate is too high. But if there is a condition that prevents the source from receiving packet-too-big ICMPv6 messages then the algorithm fails, and the application may hang when full-sized TCP packets are passed through the network. In some cases this may happen at a point well distanced from the two endpoints of the TCP session, so that the ICMPv6 filtering may be occurring at a point that is not under the control of the source or the destination. This is the basic reason why so many web server systems are averse to configuring themselves as dual stack IPv4 and IPv6 servers. The problem is that through no fault of their own in the local configuration of the IPv6 server, and through no fault in the configuration of the IPv6 client, there are situations where the application fails, even though every part of the system appears to be functioning. What's the answer? I suppose that there are a number of approaches, depending on what part of the network you want to alter. The following is by no means an exhaustive list, but looks at the most reasonable changes that could fix this problem. Change the Network's Behaviour The root cause of the problem here is the missing ICMPv6 packet, which has been swallowed up by either a filter or a poorly constructed NAT. Filters should permit packet-too-big ICMP messages, and probably should allow a the following set of ICMP message in order to do no harm to IPv6 Path MTU discovery. Here's an example set of ICMP filters. deny icmp any any fragments permit icmp any any echo permit icmp any any echo-reply permit icmp any any packet-too-big permit icmp any any time-exceeded permit icmp any any port-unreachable permit icmp any any net-unreachable deny icmp any any But its not easy to ensure that every possible filter on every possible path is updated to work correctly, and perhaps there are responses that the client or the server might also be able to undertake that would mitigate this "IPv6 hang" behaviour directly. Change the Client Application Behaviour The client's dual stack browser behaviour could be altered. The problem here is that the browser operates in serial mode and caches its results. In this case the browser has performed a DNS quad-A query for the server's name and received a IPv6 address in response, and has then successfully completed a TCP handshake. As far as the browser is concerned at the completion of this initial TCP handshake it can talk to this server using IPv6 and it caches this result, and will now not attempt to fall back to use IPv4 within the lifetime of this session. When the HTTP application fails to complete a GET operation using the IPv6 transport, the browser assumes a connectivity failure, and will not attempt to restart TCP using IPv4 and retry the GET operation using IPv4. It is possible to consider an altered browser behaviour where it would perform the initial DNS name resolution in both IPv4 and IPv6 in parallel, and then perform the initial handshake in parallel as well, and then use the protocol that returns first to perform the HTTP operation, leaving the other protocol channel open, and fall back to use it in the event of a HTTP level timeout of a GET. Of course this imposes greater levels of load on the dual stack server with all these persistent connections being held open just in case, and of course it really does not solve the underlying problem here that will exhibit itself in any TCP application that attempts to use full size TCP packets, not just HTTP fetches But if the basic catch cry of networking is "every application should fend for itself", then equipping a dual stack client's browser with greater levels of resilience in the face of such issues is a likely response, irrespective of the increase in server load that this may imply. Change the Client Protocol Stack The client could be configured to use an IPv6 MTU that is 40 octets smaller than the interface MTU, implicitly allowing for a UDP-based IPv4 tunnel to be present in the path. Using this approach the client would offer an MSS in the initial handshake that is 100 bytes less than the interface MTU. The two TCP endpoints will pick an initial mss that is the minimum of the two MSSs offered in the initial TCP handshake. By offering a lower MSS the client is able to alter the server's behaviour and in effect force the server to use a lower Path MTU, and therefore get around the problem of missing a packet-too-big ICMPv6 message. So if you are a dual stack client and you are seeing some strange behaviour from servers, namely that they fail to deliver content, and you feel confident you know what you are doing, then the brute force: # ifconfig en0 mtu 1400 would be enough to get over most of these issues. It would have the side effect of dropping the packet size for all packets, but the performance impact of such an alteration in the maximum packet size is relatively minor in all but the most demanding of environments. Change the Server Protocol Behaviour And what of the server? The server could change its application behaviour to interpret a TCP transmission timeout failure as a potential MTU failure after 3 retransmit intervals and try the fourth retransmission with a minimum size IPv6 packet of 1280 bytes. But asking vendors to change their TCP behaviour to incorporate various patches and workarounds is always going to be a hard ask, and there is always a more direct solution for servers in any case. If you want to increase the reliability of the IPv6 part of your dial stack server, then pick a conservative MTU for your server. What works for the client will work for the server, and: # ifconfig mtu 1400 will ensure that the IPv6 service will probably not fail due to undetected Path MTU problems. What I've done for http://www.potaroo.net, which is a dual-stack server, is to drop the MTU as follows: # ifconfig bge1 mtu 1400 $ ifconfig bge1 bge1: flags=8843 mtu 1400 options=1b inet6 fe80::215:c5ff:fefc:5f07%bge1 prefixlen 64 scopeid 0x2 inet 203.119.0.116 netmask 0xfffffff0 broadcast 203.119.0.127 inet6 2001:dc0:2001:7:215:c5ff:fefc:5f07 prefixlen 64 ether 00:15:c5:fc:5f:07 media: Ethernet autoselect (100baseTX ) status: active This seems to me to represent the most practical approach to making dual stack clients and servers more reliable in the face of these vagaries of network behaviour. Waiting for every filter in the Internet to do the right thing with ICMP messages may well be a fruitless task, and adding further complexity into applications or the TCP protocol behaviour seems to take the long way around the problem. The most effective approach appears to be the simplest one as well - whether you are a dual stack client, or a dual stack server, the best way to get more reliable service under these rather strange corner cases is to drop your MTU. How to create a Reliable Dual Stack Environment In listening to a number of presentations, most notably in recent times presentations relating to popular web sites that have avoided configuring a dual stack service, I am struck that there is a widespread perception that switching a server to dual stack creates all kinds of strange problems for end users. I hope the above analysis offers some reasonable explanation as to what is happening in the network to cause these particular problems, and provided some possible approaches as to how servers can avoid the issue. It seems that if there is one practical lesson from all this its simple: if you are worried about the possible adverse consequences of converting your servers to support a dual stack configuration, and you're waiting for the ICMP filtering problem to be "fixed" across the Internet, then you'll be waiting for a very long time. You can take a very easy step as the system administrator of a dual stack server by simply dropping the MTU of the interfaces used for IPv6 traffic by a further 40 bytes or so, to allow for the possibility of "hidden" tunnels in the IPv6 path to your clients. At the start of this article I referenced ipv6.google.com. What MSS does this server offer on a TCP connection? ** client to server - MSS of 1440 05:12:35.661598 IP6 (hlim 64, next-header: TCP (6), length: 44) 2001:dc0:2001:10:217:f2ff:fec9:1b10.51562 > 2001:4860:b002::68.http: S 77702533:77702533(0) win 65535 ** server to client - MSS of 1212 05:12:35.989745 IP6 (hlim 243, next-header: TCP (6), length: 24) 2001:4860:b002::68.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.51562: S, cksum 0xac69 (correct), 704559415:704559415(0) ack 77702534 win 8190 ** completion of 3 way TCP handshake 05:12:35.989841 IP6 (hlim 64, next-header: TCP (6), length: 20) 2001:dc0:2001:10:217:f2ff:fec9:1b10.51562 > 2001:4860:b002::68.http: ., cksum 0xe32c (correct), 1:1(0) ack 1 win 65535 ** HTTP GET request 05:12:35.990384 IP6 (hlim 64, next-header: TCP (6), length: 845) 2001:dc0:2001:10:217:f2ff:fec9:1b10.51562 > 2001:4860:b002::68.http: P 1:826(825) ack 1 win 65535 ** ACK of request 05:12:36.351407 IP6 (hlim 60, next-header: TCP (6), length: 20) 2001:4860:b002::68.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.51562:., cksum 0xc4ab (correct), 1:1(0) ack 826 win 6984 ** first data packet 1212 bytes of payload, 1272 byte packet 05:12:36.376562 IP6 (hlim 60, next-header: TCP (6), length: 1232) 2001:4860:b002::68.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.51562: . 1:1213(1212) ack 826 win 6984 ** second data packet 1212 bytes of payload, 1272 byte packet 05:12:36.383811 IP6 (hlim 60, next-header: TCP (6), length: 1232) 2001:4860:b002::68.http > 2001:dc0:2001:10:217:f2ff:fec9:1b10.51562: . 1213:2425(1212) ack 826 win 6984 ** ACK of first two data packets 05:12:36.383884 IP6 (hlim 64, next-header: TCP (6), length: 20) 2001:dc0:2001:10:217:f2ff:fec9:1b10.51562 > 2001:4860:b002::68.http: ., cksum 0xd67b (correct), 826:826(0) ack 2425 win 65535 It seems from this trace that this instance of the site is offering an MSS of 1212, corresponding to an MTU of 1272 if one allows for the conventional 60 bytes of IPv6 and TCP headers per packet. This is 8 bytes less than the minimum supported MTU for IPv6 of 1280 bytes. This is perhaps the most conservative position with respect to tunnelling fragmentation, and maximises to prospects of successfully connecting to the server over IPv6. Although why 1272 and not 1280 is used here remains an open question. So, for servers, the conservative message if you want a robust dual stack operation is to set the server's MTU for IPv6 to at least 40 bytes less than the interface MTU size, or even to consider resetting this down to 1280 bytes to take the most conservative position. And the same goes for clients. If you are experiencing strange "hangs" in your local dual stack environment, you might try dropping the local MTU size of your host's network interfaces by some 40 bytes or so before you take the more drastic step of turning off all IPv6 support on your platform. Postscript I started this article with a problem I was having with retrieving documents over IPv6 from http://www.rfc-editor.org. The problem has magically been fixed for me, but I didn't do a thing. While I was preparing this material my path to this server changed with the addition of a new tunnel. Here's the new path that I now see to this server. $ traceroute6 www.rfc-editor.org traceroute6 to www.rfc-editor.org (2001:1878:400:1:214:4fff:fe67:9351) from 2001:dc0:2001:10:217:f2ff:fec9:1b10, 30 hops max, 12 byte packets 1 2001:dc0:2001:10:2b0:c2ff:fe8d:c034 1.772 ms 1.233 ms 1.286 ms 2 2001:dc0:2001:249::1 36.253 ms 35.054 ms 34.582 ms 3 2001:dc0:8000::2:2 35.669 ms 37.18 ms 36.038 ms 4 apnic-1.tunnel.tserv1.fmt.ipv6.he.net 231.558 ms 231.723 ms 233.099 ms 5 1g-3-9.core1.fmt1.ipv6.he.net 239.552 ms 251.8 ms 240.694 ms 6 10gigabitethernet1-1.core1.pao1.he.net 239.208 ms 234.838 ms 231.211 ms 7 10gigabitethernet1-1.core1.lax1.he.net 240.625 ms 244.372 ms 246.882 ms 8 10gigabitethernet1-3.core1.lax2.he.net 240.248 ms 241.717 ms 243.527 ms 9 2001:504:13::9 244.329 ms 244.358 ms 247.989 ms 10 2001:1878:8::2 242.494 ms 240.967 ms 241.866 ms 11 www.rfc-editor.org 243.669 ms 240.794 ms 243.642 ms Its pretty clear even from the DNS names of the intermediate routers here that there's at least one tunnel in the path, between hops 3 and 4, and the IPv6 Path MTU is no larger than 1480 bytes, so the TCP offered MSS should be 1420 bytes in the initial TCP handshake. However neither client nor server are initially aware of the existence of tunnels, so the conditions are much the same as the earlier condition that had the problem. But this time there is no problem. Safari is successfully pulling in the web pages from the server without a problem: ** Start TCP session with three-way handshake - note the offer of a 1440 MSS on both sides 21:32:52.200911 IP6 (flowlabel 0x24763, hlim 64, next-header: TCP (6), length: 40) 2001:dc0:2001:10:20e:7fff:feac:d687.49587 > www.rfc-editor.org.http: S, cksum 0x6e41 (correct), 1009278231:1009278231(0) win 65535 21:32:52.447775 IP6 (hlim 50, next-header: TCP (6), length: 44) www.rfc-editor.org.http > 2001:dc0:2001:10:20e:7fff:feac:d687.49587: S, cksum 0x3c26 (correct), 3310917243:3310917243(0) ack 1009278232 win 49980 21:32:52.447852 IP6 (flowlabel 0x24763, hlim 64, next-header: TCP (6), length: 32) 2001:dc0:2001:10:20e:7fff:feac:d687.49587 > www.rfc-editor.org.http: ., cksum 0xbed6 (correct), ack 1 win 32844 ** Client to Server: HTTP GET request for document and Server to Client ACK 21:32:52.448050 IP6 (flowlabel 0x24763, hlim 64, next-header: TCP (6), length: 438) 2001:dc0:2001:10:20e:7fff:feac:d687.49587 > www.rfc-editor.org.http: P, cksum 0xe3e2 (correct), 1:407(406) ack 1 win 32844 21:32:52.714093 IP6 (hlim 50, next-header: TCP (6), length: 32) www.rfc-editor.org.http > 2001:dc0:2001:10:20e:7fff: feac:d687.49587: ., cksum 0x7bcb (correct), ack 407 win 49574 ** Server to Client: HTTP: Server OK response 21:32:52.717158 IP6 (hlim 50, next-header: TCP (6), length: 351) www.rfc-editor.org.http > 2001:dc0:2001:10:20e:7fff:feac: d687.49587: P, cksum 0x4276 (correct), 1:320(319) ack 407 win 49574 ** Server to Client: First data packet - note the IPv6 packet size is now 1480 bytes, so the Server has performed Path MTU discovery and reduced the MTU by 20 bytes. Note also that this packet was received 21 milliseconds after the previous packet, indicating that the tunnel point on this path is within 10 milliseconds of the server 21:32:52.738526 IP6 (hlim 50, next-header: TCP (6), length: 1440) www.rfc-editor.org.http > 2001:dc0:2001:10:20e:7fff:feac: d687.49587: ., cksum 0xd515 (correct), 1:1409(1408) ack 407 win 49574 ** Client to Server: ACK 21:32:52.738644 IP6 (flowlabel 0x24763, hlim 64, next-header: TCP (6), length: 32) 2001:dc0:2001:10:20e:7fff:feac:d687.49587 > www.rfc-editor.org.http: ., cksum 0xb8a3 (correct), ack 1409 win 32299 ** Server to Client: Second and third data packets and Server to Client ACK 21:32:52.747296 IP6 (hlim 50, next-header: TCP (6), length: 1440) www.rfc-editor.org.http > 2001:dc0:2001:10:20e:7fff:feac: d687.49587: ., cksum 0xacda (correct), 1409:2817(1408) ack 407 win 49574 21:32:52.749191 IP6 (hlim 50, next-header: TCP (6), length: 391) www.rfc-editor.org.http > 2001:dc0:2001:10:20e:7fff:feac: d687.49587: ., cksum 0x4b4e (correct), 2817:3176(359) ack 407 win 49574 21:32:52.749294 IP6 (flowlabel 0x24763, hlim 64, next-header: TCP (6), length: 32) 2001:dc0:2001:10:20e:7fff:feac:d687.49587 > www.rfc-editor.org.http: ., cksum 0xb044 (correct), ack 3176 win 32664 ** Client opens a new TCP session 21:32:52.771805 IP6 (flowlabel 0x087bc, hlim 64, next-header: TCP (6), length: 40) 2001:dc0:2001:10:20e:7fff:feac:d687.49588 > www.rfc-editor.org.http: S, cksum 0x5ae5 (correct), 3472874335: 3472874335(0) win 65535 ** Server to Client - TCP handshake response - note that the server has now cached the smaller Path MTU for this client, and the offered MSS is now 1420 in the initial TCP handshake 21:32:53.012890 IP6 (hlim 50, next-header: TCP (6), length: 44) www.rfc-editor.org.http > 2001:dc0:2001:10:20e:7fff:feac: d687.49588: S, cksum 0xd216 (correct), 64991551:64991551(0) ack 3472874336 win 49280 21:32:53.013007 IP6 (flowlabel 0x087bc, hlim 64, next-header: TCP (6), length: 32) 2001:dc0:2001:10:20e:7fff:feac:d687.49588 > www.rfc-editor.org.http: ., cksum 0x5109 (correct), ack 1 win 33088 [and so on] The difference here is that on this new path there is an ICMPv6 packet-too-big error being generated in response to the server's first full size data packet, but this time it is being successfully passed back to the server site, as it is evident that the server has adjusted its path MTU value for the first large packet. The tentative conclusion of this observation is that the problem in the previous path that appeared to be blocking ICMPv6 packets was probably not necessarily a filter problem within server site. It is more likely that the ICMPv6 filter existed further along the original path, and was probably not connected with either the server's immediate network environment, nor that of the client. In closing, I should stress that this is definitely not an issue with the RFC editor's web site per se, nor the network and the associated local filters that are used by the RFC Editor's local network environment. I am sure that they operate a professional service and the problem examined here was not related to the RFC Editor's service infrastructure in any way. The fault I was experiencing here appears to have been a fault in over-zealous ICMP filtering further along the network path between this web site and myself. Disclaimer The above views do not necessarily represent the views of the Asia Pacific Network Information Centre. About the Author GEOFF HUSTON holds a B.Sc. and a M.Sc. from the Australian National University. He has been closely involved with the development of the Internet for many years, particularly within Australia, where he was responsible for the initial build of the Internet within the Australian academic and research sector. He is author of a number of Internet-related books, and is currently the Chief Scientist at APNIC, the Regional Internet Registry serving the Asia Pacific region. He was a member of the Internet Architecture Board from 1999 until 2005, and served on the Board of the Internet Society from 1992 until 2001. www.potaroo.net