Research on the cause of call interruptions (part 3)
Christian Lathion, 2009-02-06
Switzernet
Research on the cause of call interruptions (part 3)
No interruptions at 332 seconds
Confirming the bug on our testbed OpenSER server
This document describes our tests that were made to identify possible causes of call interruptions (see a statistical study of the call interruptions observed in our network [2009-02-01]). We continue the previous studies of the problem [2009-02-03] [2009-02-05], and try to address some of the open questions, confirm the current hypothesis and proposed solution for fighting call interruptions.
It was previously not clear why we do not obtain an interruption peak at 332 seconds. Since the calls are interrupted 300 seconds after the response to the first re-INVITE is lost, the first interruption peak should be located at 332 seconds. However, our statistics demonstrate that the first peak takes place at 364 seconds:
To address this question, the following call trace shows what happens if the very first re-INVITE is lost. The red arrow on the screenshot shows where the lost packet should have been (the 488 final response is sent by the OpenSER server, but not received by the SIP proxy on which the capture was taken):
Full capture: [pcap]
After the packet loss, behavior seems to be as described in our previous experiments [2009-02-03]. We would expect the call to be finally interrupted by the SIP proxy at 345 seconds (loss of the 488 packet + 300 seconds), which is 332 seconds if we subtract the call setup time of 13 seconds (see the 200 OK and ACK of the call setup INVITE). This is not the case here. We see the call continues after the predicted timer expiration:
It is only 900 seconds after the packet loss that the SIP proxy finally terminates the call by issuing a BYE request:
Two other call traces (text format) point the same behavior: [801017579.txt] [1053913142.txt]. This delayed call interruption would then explain why we don’t see any interruption peak at 332s. It also explains the abnormally high peak at 932s, which is not matching the remaining pattern of multiples of 32.1 seconds followed by a 300-second delay.
Further researches would be needed to find out if this behavior is a bug in the timer management. For the moment, we can see that the 932s interruption peak has the same source than the others. Solving the retransmission bug will solve this interruption peak as well.
For configuration tests, we use a testbed Kamailio server (Kamailio is a fork of Openser. For clarity, we use the OpenSER name for both server versions in the rest of this document). Since it has not the exact same configuration and version than the running production servers, we start by reproducing the bug on the test system. We set the re-INVITE handling the same way as on the running servers, i.e. stateless handling:
if(loose_route())
{
$var(comm)="LooseR.";
route(11);
if(method=="INVITE")
{
sl_send_reply("100","Your Re-INVITE is received");
sl_send_reply("200","OK");
exit;
}
t_relay();
exit;
}
An interrupted call on our testbed shows the same behavior as on the production servers. A 488 final reply is lost at 913s. Further re-INVITES are not sent by the SIP proxy anymore, because of the 100 provisional response which was sent from the OpenSER:
Full capture: [pcap]
300 seconds (1213.310 - 913.162 = 300.148) after the loss of the 488 response, the call is interrupted by the SIP proxy:
Now when we know that the bug can be reproduced on the testbed server, we can try the previously proposed configuration fix, allowing stateful handling of re-INVITES. This should provide retransmission of lost 488 packets. The configuration modifications are as follows:
if(loose_route())
{
if(method=="INVITE")
{
sl_send_reply("100","Your Re-INVITE is received");
t_newtran();
t_reply("488","Your Re-INVITE is ignored");
exit;
}
t_relay();
exit;
}
Since we want to catch transmission errors, we take two different captures for the same call. One is made on the SIP proxy (fr4, 91.121.75.124), the other on the testbed OpenSER server (1and1, 87.106.212.173).
In the capture of SIP proxy at 166s we see that provisional response 100 is missing [screenshot]. This error has no impact on the call, receiving this packet is not required. Though this response is treated statelessly in our current test configuration, the packet will not be retransmitted even with stateful processing. Their loss cannot even be detected as they don’t require any answer.
To see the lost packets, we must look on the OpenSER capture. Note that the timestamps between the two captures on two different machines have a shift of approx. 500 milliseconds. The first occurrence of retransmission of a 488 response occurs at 390s. The interval between the two transmissions is 391.244 - 390.819 = 425ms:
Full capture on the 1and1 OpenSER server: [pcap]
The capture on the SIP proxy of course shows only one received 488 packet:
Full capture on the SIP proxy: [pcap]
If we look at the time interval between transmission of the re-INVITE and reception of the 488 reply, we obtain 391.749 - 391.302 = 447ms. This confirms that the received packet is a retransmission (without retransmission, the same time interval is typically below 50ms).
This test will show the behavior of our SIP proxy when no reply is received for a re-INVITE (neither a provisional nor a final response). From the RFC, it should retransmit re-INVITES with a decreasing frequency, starting at 500ms and doubling the interval at each retransmission:
The INVITE transaction consists of a three-way handshake. The client transaction sends an INVITE, the server transaction sends responses, and the client transaction sends an ACK. For unreliable transports (such as UDP), the client transaction retransmits requests at an interval that starts at T1 seconds and doubles after every retransmission. T1 is an estimate of the round-trip time (RTT), and it defaults to 500 ms. Nearly all of the transaction timers described here scale with T1, and changing T1 adjusts their values. The request is not retransmitted over reliable transports. After receiving a 1xx response, any retransmissions cease altogether, and the client waits for further responses. The server transaction can send additional 1xx responses, which are not transmitted reliably by the server transaction. Eventually, the server transaction decides to send a final response. For unreliable transports, that response is retransmitted periodically, and for reliable transports, it is sent once. For each final response that is received at the client transaction, the client transaction sends an ACK, the purpose of which is to quench retransmissions of the response. [RFC3261, p.125]
The observed behavior matches the recommendations:
Full capture: [pcap]
Which gives the following time intervals:
Re-INVITE count |
Time |
Interval [ms] (approx.) |
1 |
45.392 |
|
2 |
45.890 |
500 |
3 |
46.900 |
1000 |
4 |
48.966 |
2000 |
5 |
52.974 |
4000 |
6 |
60.981 |
8000 |
7 |
77.026 |
16000 |
After 6 retransmissions, the server stops sending re-INVITES to the OpenSER. The call is disconnected 360 seconds after the last retransmission. This appears to match the fr_inv_timer settings in the server configuration.
We could also test the same scenario when the second re-INVITE is lost, since losing the first appears to have a special behavior (see section No interruptions at 332 seconds). We forget this for the moment, as presently we do not expect it to bring us additional elements for solving the actual call interruptions problem.
Modify the configuration to handle all possible cases (checks). To make things cleaner, the 100 provisional response will also be introduced in the transaction.
Include the modification on production servers and check for possible side-effects.
Research on the cause of call interruptions (part 1) [ch1] [ch2]
Research on the cause of call interruptions (part 2) [ch1] [ch2]
Statistics on the global interruptions problem [ch1] [ch2]
Statistics on all call interruptions for 2008: [ch1] [ch2]
* * *