Network problems between OVH (Paris) and BBCOM (Los Angeles)
Switzernet
2008-09-25
Since 2008-09-16 [ch1], [ch2] we experience problems with our four French servers hosted at OVH. Network problems occur several times a day simultaneously on all our 4 SIP servers at OVH.
Network problems between OVH (Paris) and BBCOM (Los Angeles)
1.1. CPU load versus the overall number of concurrent calls
1.2. Packet loss rate versus CPU load
3. Time to live exceeded messages
4. Traceroute during the problem
One of these network problems is observed more closely on 2008-09-24 at about 18:00. Ping and the traceroute outputs are recorded. The ping records showed TTL exceeded messages several times. Such message shall suggest a looping or a temporarily lost of the route. The full 10-hour ping output is joined [txt]. Ping is started on 2008-09-24 at 11:35:23 from fr1.youroute.net (91.121.66.202) to us1.youroute.net (66.234.138.73). More printouts are provided in section Time to live exceeded messages.
148 bytes from cr2.la2ca.ip.att.net (12.122.30.30): Time to live exceeded
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
The packet loss rate and the average RTT is computed for 300 second intervals of the 10-hour ping period [txt].
[xls]
A traceroute was made during the problem of 18h. However the program was launched only a few second before the end of the failure. In the traceroute’s output below, we can only observe the problem at 2nd and 5th hops. The problem simply disappeared when traceroute passed to next hops. The output suggests that the problem is possibly very close to our servers in France and is inside the OVH network. The full traceroute screenshot is in section Traceroute during the problem.
traceroute to us1.youroute.net (66.234.138.73), 64 hops max, 40 byte packets
1 rbx-16-m2.routers.ovh.net (91.121.66.252) 0.526 ms 0.385 ms 0.461 ms
2 rbx-2-6k.routers.ovh.net (213.251.191.130) 312.785 ms 530.491 ms *
3 * * 160g.gsw-2-6k.routers.ovh.net (213.186.32.221) 38.391 ms
4 * * *
5 30g.gblx.gsw-1-6k.routers.ovh.net (213.186.32.129) 150.169 ms 287.610 ms 27.970 ms
6 te-4-2.car2.Paris1.level3.net (4.68.127.97) 23.790 ms 12.431 ms 12.229 ms
7 ae-32-54.ebr2.Paris1.Level3.net (4.68.109.126) 19.745 ms
ae-31-51.ebr1.Paris1.Level3.net (4.68.109.30) 16.646 ms
ae-32-54.ebr2.Paris1.Level3.net (4.68.109.126) 14.303 ms
...
Connections with billing servers in USA are affected during each such problem. The CPU loads of affected SIP servers jump in the attempt to maintain the rapidly growing number of opened and incomplete SIP transactions. Encircled, is the peak of the CPU chart corresponding to the above discussed failure of 18h (see also sections 1 and 2):
During the observed problems the calls of affected servers are dropped. The diagram below shows the matching points between the call load histogram and the CPU peaks of SIP servers. The green/blue histogram shows the overall network load toward the Geneva interconnection point [ch1]. For each CPU peak of OVH servers the overall number of concurrent calls drops noticeably. The CPU load chart represents seven SIP servers and we see that the problems occur only due to four OVH servers fr1.youroute.net (91.121.66.202), fr2.youroute.net (91.121.19.149), fr3.youroute.net (91.121.101.126), and fr4.youroute.net (91.121.75.124).
High packet loss rate intervals (of 300 seconds) [xls] are matching with the CPU peaks of the four SIP servers:
OVH is being informed but the problem is not localized or confirmed. Switzernet is in course of launching an additional server in UK for moving a part of the load away from French servers. A server in Denmark is scheduled if UK operation succeeds. BBCOM will be informed in case BGP routing can be the cause. A 15h test is launched [txt].
On 2008-09-24 at about 18:00 we observed time to live exceeded messages, while pinging our US server from OVH server fr1.youroute.net (91.121.66.202).
The screenshot shows the TTL exceeded messages in 30-minute old records. Below is printout of same messages. The full ping output file with 36’000 sent packets was started on 2008-09-24 at 11:35:23 and is joined [txt].
sona@fr1$
sona@fr1$ date
Wed Sep 24 18:43:07 CEST 2008
sona@fr1$
sona@fr1$ tail -2200 080924.113523-pingfrom-fr1.youroute.net.txt | grep -v ^64
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 8ffd 0 0000 01 01 be35 91.121.66.202 66.234.138.73
148 bytes from cr2.la2ca.ip.att.net (12.122.30.30): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 90e0 0 0000 06 01 b852 91.121.66.202 66.234.138.73
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 911e 0 0000 01 01 bd14 91.121.66.202 66.234.138.73
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 92d8 0 0000 01 01 bb5a 91.121.66.202 66.234.138.73
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 9598 0 0000 01 01 b89a 91.121.66.202 66.234.138.73
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 95d9 0 0000 01 01 b859 91.121.66.202 66.234.138.73
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 95ed 0 0000 01 01 b845 91.121.66.202 66.234.138.73
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 9613 0 0000 01 01 b81f 91.121.66.202 66.234.138.73
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 9622 0 0000 01 01 b810 91.121.66.202 66.234.138.73
36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 5400 9660 0 0000 01 01 b7d2 91.121.66.202 66.234.138.73
sona@fr1$
sona@fr1$ date
Wed Sep 24 18:43:16 CEST 2008
sona@fr1$
sona@fr1$ tail -2000 080924.113523-pingfrom-fr1.youroute.net.txt | grep -v ^64
sona@fr1$ date
Wed Sep 24 18:43:41 CEST 2008
sona@fr1$
The traceroute was launched just a few seconds before the problem of 18h disappeared. The problem disappeared when traceroute started to check the 6th hop. Before the hop 6 we see problems inside the network of OVH. The delay of 288ms with the router at hop 5 and the delay of 530ms with the router at hop 2 indicate on problems in the network.
Below is the output of the same traceroute.
$ ssh sona@fr1.youroute.net
DSA host key for IP address '91.121.66.202' not in list of known hosts.
Last login: Wed Sep 24 13:58:38 2008 from 105.9.202.62.fi
Copyright (c) 1980, 1983, 1986, 1988, 1990, 1991, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 6.2-RELEASE (NEWKERNSMP5) #0: Wed Nov 28 17:40:48 CET 2007
server : 26060
ip : 91.121.66.202
hostname : fr1.youroute.net
To see the output from when your computer started, run dmesg(8). If it has
been replaced with other messages, look at /var/run/dmesg.boot.
-- Francisco Reyes <lists@natserv.com>
sona@fr1$ traceroute us1.youroute.net
traceroute to us1.youroute.net (66.234.138.73), 64 hops max, 40 byte packets
1 rbx-16-m2.routers.ovh.net (91.121.66.252) 0.526 ms 0.385 ms 0.461 ms
2 rbx-2-6k.routers.ovh.net (213.251.191.130) 312.785 ms 530.491 ms *
3 * * 160g.gsw-2-6k.routers.ovh.net (213.186.32.221) 38.391 ms
4 * * *
5 30g.gblx.gsw-1-6k.routers.ovh.net (213.186.32.129) 150.169 ms 287.610 ms 27.970 ms
6 te-4-2.car2.Paris1.level3.net (4.68.127.97) 23.790 ms 12.431 ms 12.229 ms
7 ae-32-54.ebr2.Paris1.Level3.net (4.68.109.126) 19.745 ms
ae-31-51.ebr1.Paris1.Level3.net (4.68.109.30) 16.646 ms
ae-32-54.ebr2.Paris1.Level3.net (4.68.109.126) 14.303 ms
8 ae-41.ebr2.Washington1.Level3.net (4.69.137.50) 88.374 ms
ae-1-100.ebr2.Paris1.Level3.net (4.69.133.82) 19.702 ms
ae-41.ebr2.Washington1.Level3.net (4.69.137.50) 89.489 ms
9 ae-92-92.csw4.Washington1.Level3.net (4.69.134.158) 89.240 ms
ae-41.ebr2.Washington1.Level3.net (4.69.137.50) 89.450 ms 88.293 ms
10 ae-62-62.csw1.Washington1.Level3.net (4.69.134.146) 87.324 ms
ae-64-64.ebr4.Washington1.Level3.net (4.69.134.177) 99.854 ms 90.129 ms
11 ae-4.ebr3.LosAngeles1.Level3.net (4.69.132.81) 156.736 ms
ae-64-64.ebr4.Washington1.Level3.net (4.69.134.177) 95.966 ms
ae-4.ebr3.LosAngeles1.Level3.net (4.69.132.81) 156.072 ms
12 ae-63-63.csw1.LosAngeles1.Level3.net (4.69.137.34) 157.020 ms
ae-4.ebr3.LosAngeles1.Level3.net (4.69.132.81) 167.589 ms 162.387 ms
13 ae-63-63.csw1.LosAngeles1.Level3.net (4.69.137.34) 156.419 ms
ae-12-69.car2.LosAngeles1.Level3.net (4.68.20.4) 155.066 ms
ae-63-63.csw1.LosAngeles1.Level3.net (4.69.137.34) 163.664 ms
14 BACKBONE-CO.car2.LosAngeles1.Level3.net (4.71.142.82) 155.506 ms 155.552 ms
ae-12-69.car2.LosAngeles1.Level3.net (4.68.20.4) 155.892 ms
15 BACKBONE-CO.car2.LosAngeles1.Level3.net (4.71.142.82) 155.393 ms
bvi01-ar02-1w-lax.bb2.net (66.234.135.51) 157.945 ms 156.298 ms
16 bvi01-ar02-1w-lax.bb2.net (66.234.135.51) 156.758 ms
switzernet-lax-cust.bb2.net (66.234.129.206) 155.505 ms
bvi01-ar02-1w-lax.bb2.net (66.234.135.51) 156.716 ms
17 switzernet-lax-cust.bb2.net (66.234.129.206) 155.755 ms 155.670 ms 155.850 ms
18 porta-sip. (66.234.138.73) 155.709 ms 155.441 ms 155.868 ms
sona@fr1$
sona@fr1$
* * *