Network problems between OVH (Paris) and BBCOM (Los Angeles)

 

Switzernet

2008-09-25

 

Since 2008-09-16 [ch1], [ch2] we experience problems with our four French servers hosted at OVH. Network problems occur several times a day simultaneously on all our 4 SIP servers at OVH.

 

Network problems between OVH (Paris) and BBCOM (Los Angeles) 1

1.1.   CPU load versus the overall number of concurrent calls. 4

1.2.   Packet loss rate versus CPU load. 7

2.   Actions. 9

3.   Time to live exceeded messages. 9

4.   Traceroute during the problem.. 13

 

One of these network problems is observed more closely on 2008-09-24 at about 18:00. Ping and the traceroute outputs are recorded. The ping records showed TTL exceeded messages several times. Such message shall suggest a looping or a temporarily lost of the route. The full 10-hour ping output is joined [txt]. Ping is started on 2008-09-24 at 11:35:23 from fr1.youroute.net (91.121.66.202) to us1.youroute.net (66.234.138.73). More printouts are provided in section Time to live exceeded messages.

 

148 bytes from cr2.la2ca.ip.att.net (12.122.30.30): Time to live exceeded

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

 

The packet loss rate and the average RTT is computed for 300 second intervals of the 10-hour ping period [txt].

[xls]

 

A traceroute was made during the problem of 18h. However the program was launched only a few second before the end of the failure. In the traceroute’s output below, we can only observe the problem at 2nd and 5th hops. The problem simply disappeared when traceroute passed to next hops. The output suggests that the problem is possibly very close to our servers in France and is inside the OVH network. The full traceroute screenshot is in section Traceroute during the problem.

 

traceroute to us1.youroute.net (66.234.138.73), 64 hops max, 40 byte packets

 1  rbx-16-m2.routers.ovh.net (91.121.66.252)  0.526 ms  0.385 ms  0.461 ms

 2  rbx-2-6k.routers.ovh.net (213.251.191.130)  312.785 ms  530.491 ms *

 3  * * 160g.gsw-2-6k.routers.ovh.net (213.186.32.221)  38.391 ms

 4  * * *

 5  30g.gblx.gsw-1-6k.routers.ovh.net (213.186.32.129)  150.169 ms  287.610 ms  27.970 ms

 6  te-4-2.car2.Paris1.level3.net (4.68.127.97)  23.790 ms  12.431 ms  12.229 ms

 7  ae-32-54.ebr2.Paris1.Level3.net (4.68.109.126)  19.745 ms

    ae-31-51.ebr1.Paris1.Level3.net (4.68.109.30)  16.646 ms

    ae-32-54.ebr2.Paris1.Level3.net (4.68.109.126)  14.303 ms

...

 

Connections with billing servers in USA are affected during each such problem. The CPU loads of affected SIP servers jump in the attempt to maintain the rapidly growing number of opened and incomplete SIP transactions. Encircled, is the peak of the CPU chart corresponding to the above discussed failure of 18h (see also sections 1 and 2):

 

[gif], [more]

 

1.1.                    CPU load versus the overall number of concurrent calls

 

During the observed problems the calls of affected servers are dropped. The diagram below shows the matching points between the call load histogram and the CPU peaks of SIP servers. The green/blue histogram shows the overall network load toward the Geneva interconnection point [ch1]. For each CPU peak of OVH servers the overall number of concurrent calls drops noticeably. The CPU load chart represents seven SIP servers and we see that the problems occur only due to four OVH servers fr1.youroute.net (91.121.66.202), fr2.youroute.net (91.121.19.149), fr3.youroute.net (91.121.101.126), and fr4.youroute.net (91.121.75.124).

 

 

1.2.                    Packet loss rate versus CPU load

 

High packet loss rate intervals (of 300 seconds) [xls] are matching with the CPU peaks of the four SIP servers:

 

2.   Actions

OVH is being informed but the problem is not localized or confirmed. Switzernet is in course of launching an additional server in UK for moving a part of the load away from French servers. A server in Denmark is scheduled if UK operation succeeds. BBCOM will be informed in case BGP routing can be the cause. A 15h test is launched [txt].

 

3.   Time to live exceeded messages

 

On 2008-09-24 at about 18:00 we observed time to live exceeded messages, while pinging our US server from OVH server fr1.youroute.net (91.121.66.202).

 

 

The screenshot shows the TTL exceeded messages in 30-minute old records. Below is printout of same messages. The full ping output file with 36’000 sent packets was started on 2008-09-24 at 11:35:23 and is joined [txt].

 

sona@fr1$

sona@fr1$ date

Wed Sep 24 18:43:07 CEST 2008

sona@fr1$

sona@fr1$ tail -2200 080924.113523-pingfrom-fr1.youroute.net.txt | grep  -v ^64

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 8ffd   0 0000  01  01 be35 91.121.66.202  66.234.138.73

 

148 bytes from cr2.la2ca.ip.att.net (12.122.30.30): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 90e0   0 0000  06  01 b852 91.121.66.202  66.234.138.73

 

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 911e   0 0000  01  01 bd14 91.121.66.202  66.234.138.73

 

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 92d8   0 0000  01  01 bb5a 91.121.66.202  66.234.138.73

 

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 9598   0 0000  01  01 b89a 91.121.66.202  66.234.138.73

 

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 95d9   0 0000  01  01 b859 91.121.66.202  66.234.138.73

 

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 95ed   0 0000  01  01 b845 91.121.66.202  66.234.138.73

 

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 9613   0 0000  01  01 b81f 91.121.66.202  66.234.138.73

 

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 9622   0 0000  01  01 b810 91.121.66.202  66.234.138.73

 

36 bytes from 160g.rbx-2-6k.routers.ovh.net (213.186.32.201): Time to live exceeded

Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst

 4  5  00 5400 9660   0 0000  01  01 b7d2 91.121.66.202  66.234.138.73

 

sona@fr1$

sona@fr1$ date

Wed Sep 24 18:43:16 CEST 2008

sona@fr1$

sona@fr1$ tail -2000 080924.113523-pingfrom-fr1.youroute.net.txt | grep -v ^64

sona@fr1$ date

Wed Sep 24 18:43:41 CEST 2008

sona@fr1$

 

4.   Traceroute during the problem

 

The traceroute was launched just a few seconds before the problem of 18h disappeared. The problem disappeared when traceroute started to check the 6th hop. Before the hop 6 we see problems inside the network of OVH. The delay of 288ms with the router at hop 5 and the delay of 530ms with the router at hop 2 indicate on problems in the network.

 

 

Below is the output of the same traceroute.

 

$ ssh sona@fr1.youroute.net

DSA host key for IP address '91.121.66.202' not in list of known hosts.

Last login: Wed Sep 24 13:58:38 2008 from 105.9.202.62.fi

Copyright (c) 1980, 1983, 1986, 1988, 1990, 1991, 1993, 1994

        The Regents of the University of California.  All rights reserved.

 

FreeBSD 6.2-RELEASE (NEWKERNSMP5) #0: Wed Nov 28 17:40:48 CET 2007

 

 

server    : 26060

ip        : 91.121.66.202

hostname  : fr1.youroute.net

 

To see the output from when your computer started, run dmesg(8).  If it has

been replaced with other messages, look at /var/run/dmesg.boot.

                -- Francisco Reyes <lists@natserv.com>

sona@fr1$ traceroute us1.youroute.net

traceroute to us1.youroute.net (66.234.138.73), 64 hops max, 40 byte packets

 1  rbx-16-m2.routers.ovh.net (91.121.66.252)  0.526 ms  0.385 ms  0.461 ms

 2  rbx-2-6k.routers.ovh.net (213.251.191.130)  312.785 ms  530.491 ms *

 3  * * 160g.gsw-2-6k.routers.ovh.net (213.186.32.221)  38.391 ms

 4  * * *

 5  30g.gblx.gsw-1-6k.routers.ovh.net (213.186.32.129)  150.169 ms  287.610 ms  27.970 ms

 6  te-4-2.car2.Paris1.level3.net (4.68.127.97)  23.790 ms  12.431 ms  12.229 ms

 7  ae-32-54.ebr2.Paris1.Level3.net (4.68.109.126)  19.745 ms

    ae-31-51.ebr1.Paris1.Level3.net (4.68.109.30)  16.646 ms

    ae-32-54.ebr2.Paris1.Level3.net (4.68.109.126)  14.303 ms

 8  ae-41.ebr2.Washington1.Level3.net (4.69.137.50)  88.374 ms

    ae-1-100.ebr2.Paris1.Level3.net (4.69.133.82)  19.702 ms

    ae-41.ebr2.Washington1.Level3.net (4.69.137.50)  89.489 ms

 9  ae-92-92.csw4.Washington1.Level3.net (4.69.134.158)  89.240 ms

    ae-41.ebr2.Washington1.Level3.net (4.69.137.50)  89.450 ms  88.293 ms

10  ae-62-62.csw1.Washington1.Level3.net (4.69.134.146)  87.324 ms

    ae-64-64.ebr4.Washington1.Level3.net (4.69.134.177)  99.854 ms  90.129 ms

11  ae-4.ebr3.LosAngeles1.Level3.net (4.69.132.81)  156.736 ms

    ae-64-64.ebr4.Washington1.Level3.net (4.69.134.177)  95.966 ms

    ae-4.ebr3.LosAngeles1.Level3.net (4.69.132.81)  156.072 ms

12  ae-63-63.csw1.LosAngeles1.Level3.net (4.69.137.34)  157.020 ms

    ae-4.ebr3.LosAngeles1.Level3.net (4.69.132.81)  167.589 ms  162.387 ms

13  ae-63-63.csw1.LosAngeles1.Level3.net (4.69.137.34)  156.419 ms

    ae-12-69.car2.LosAngeles1.Level3.net (4.68.20.4)  155.066 ms

    ae-63-63.csw1.LosAngeles1.Level3.net (4.69.137.34)  163.664 ms

14  BACKBONE-CO.car2.LosAngeles1.Level3.net (4.71.142.82)  155.506 ms  155.552 ms

    ae-12-69.car2.LosAngeles1.Level3.net (4.68.20.4)  155.892 ms

15  BACKBONE-CO.car2.LosAngeles1.Level3.net (4.71.142.82)  155.393 ms

    bvi01-ar02-1w-lax.bb2.net (66.234.135.51)  157.945 ms  156.298 ms

16  bvi01-ar02-1w-lax.bb2.net (66.234.135.51)  156.758 ms

    switzernet-lax-cust.bb2.net (66.234.129.206)  155.505 ms

    bvi01-ar02-1w-lax.bb2.net (66.234.135.51)  156.716 ms

17  switzernet-lax-cust.bb2.net (66.234.129.206)  155.755 ms  155.670 ms  155.850 ms

18  porta-sip. (66.234.138.73)  155.709 ms  155.441 ms  155.868 ms

sona@fr1$

sona@fr1$

 

*   *   *