Research on the cause of call interruptions (part 2)

Christian Lathion, 2009-02-05

Switzernet

Research on the cause of call interruptions (part 2) 1

Handling of Re-Invites on another vendor interconnection. 4

This document describes our tests that were made to identify possible causes of call interruptions (see http://unappel.ch/people/emin-gabrielyan/public/090201-call-interrupts/ for a statistical study of the call interruptions present on our network). It follows a previous study of the problem [link], and will try to address some of the open questions and confirm the current hypothesis concerning call interruptions.

We now have clearer hypothesis on the call interruptions problem. Our current hypothesis is that our OpenSER servers do not retransmit final answers to re-INVITES due to a configuration bug. A transmission loss of this reply will then cause the packet to be never retransmitted to the SIP proxy, which will cancel the call.

Interrupted call duration

From the configuration of our SIP proxies, the concerned calls should be interrupted after 360 seconds. This does not match the observed samples, where the interruption occurs 300 seconds after the first lost 488 packet.

One hypothesis for this difference is as follows. As shown in the full following packet capture of a re-INVITE, the expiration time is set to 300 seconds:

U 2009/02/03 10:42:58.717810 91.121.75.124:5061 -> 212.249.15.3:5060

INVITE sip:212.249.15.3;r2=on;lr=on;ftag=c24da073a68650565bdc02cf869ce35c SIP/2.0.

Via: SIP/2.0/UDP 91.121.75.124:5061;branch=z9hG4bK9b688842ada97ebf1ac26e1c284e078e;rport.

Route: <sip:195.129.125.73;r2=on;lr=on;ftag=c24da073a68650565bdc02cf869ce35c>.

Route: <sip:212.190.89.137;lr;ftag=c24da073a68650565bdc02cf869ce35c>.

Route: <sip:+41216912818@146.188.127.2:5060>.

Max-Forwards: 70.

From: 41215500306 <sip:+41215500306@91.121.75.124>;tag=c24da073a68650565bdc02cf869ce35c.

To: <sip:+41216912818@212.249.15.3>;tag=A94D4B38-24E1.

Call-ID: d1292aac-6b12af55@192.168.1.172.

CSeq: 201 INVITE.

Contact: Anonymous <sip:91.121.75.124:5061>.

Expires: 300.

User-Agent: Sippy.

cisco-GUID: 1701763649-1195967461-2174877437-385505832.

h323-conf-id: 1701763649-1195967461-2174877437-385505832.

Content-Length: 447.

Content-Type: application/sdp.

Our SIP proxy acting as “back to back” user agent, we assume that it behaves as a client for the OpenSER termination server. If it is the case, then the behavior of our proxy matches the RFC of the SIP protocol, except that in our case, a BYE is issued instead of a CANCEL:

The UAC MAY add an Expires header field (Section 20.19) to limit the validity of the invitation. If the time indicated in the Expires header field is reached and no final answer for the INVITE has been received, the UAC core SHOULD generate a CANCEL request for the INVITE, as per Section 9. [RFC3261, p.79]

The expiration timer then fires before the other timer of 360 seconds, thus interrupting the call sooner.

OpenSER configuration

The correct handling of re-INVITES on our SIP proxies would be to directly retransmit the packet to the vendor side. On the interconnection presenting call interruptions, a bug prevents us to do so; we never receive a final reply from the vendor side (causing calls to be interrupted after a very short delay). To solve this problem, we had to use the following configuration in OpenSER which handles re-INVITES statelessly:

if(loose_route())

{

$var(comm)="LooseR.";

route(11);

if(method=="INVITE")

{

sl_send_reply("100","Your Re-INVITE is received");

sl_send_reply("200","OK");

exit;

}

t_relay();

exit;

}

A first modification was to change the final response from 200 (Ok) to 488 (Not Acceptable Here). Since we send the same reply for each re-INVITE, sending an error is more RFC-compliant: If a UA receives a non-2xx final response to a re-INVITE, the session parameters MUST remain unchanged, as if no re-INVITE had been issued. Note that, as stated in Section 12.2.1.2, if the non-2xx final response is a 481 (Call/Transaction Does Not Exist), or a 408 (Request Timeout), or no response at all is received for the re- INVITE (that is, a timeout is returned by the INVITE client transaction), the UAC will terminate the dialog. [RFC3261, p.87]. The re-INVITE configuration part of our configuration became:

if(loose_route())

{

$var(comm)="LooseR.";

route(11);

if(method=="INVITE")

{

sl_send_reply("100","Your Re-INVITE is received");

sl_send_reply("488","Your Re-INVITE is ignored");

exit;

}

t_relay();

exit;

}

This modification was expected to fix the interruptions in the case the end-user phone issues a re-INVITE because it needs to modify the session parameters. On our statistics, it did not have a noticeable effect. Maybe it fixed a part of end-user issued re-INVITES, which are not considered in the call interruptions statistics, but the real problem remains. We still process the re-INVITES statelessly and not as part of a transaction, which is in contradiction with the SIP protocol specification:

The INVITE transaction consists of a three-way handshake. The client transaction sends an INVITE, the server transaction sends responses, and the client transaction sends an ACK. For unreliable transports (such as UDP), the client transaction retransmits requests at an interval that starts at T1 seconds and doubles after every retransmission. T1 is an estimate of the round-trip time (RTT), and it defaults to 500 ms. Nearly all of the transaction timers described here scale with T1, and changing T1 adjusts their values. The request is not retransmitted over reliable transports. After receiving a 1xx response, any retransmissions cease altogether, and the client waits for further responses. The server transaction can send additional 1xx responses, which are not transmitted reliably by the server transaction. Eventually, the server transaction decides to send a final response. For unreliable transports, that response is retransmitted periodically, and for reliable transports, it is sent once. For each final response that is received at the client transaction, the client transaction sends an ACK, the purpose of which is to quench retransmissions of the response. [RFC3261, p.125]

The bold parts highlight the problem in our configuration, leading to the observed behavior. Since we handle re-INVITE packets statelessly on the OpenSER, we have no retransmission of the final response in case of packet loss. Since the SIP proxy receives a provisional response, it will also stop retransmissing its re-INVITES, causing the call to be forcefully interrupted after its timer expires. We have to find a way to retransmit the final response in case of packet loss, without relaying the re-INVITE to the vendor termination, which does not respond to these packets.

The following code is an initial attempt to manage re-INVITES statefully by creating a new transaction. We still reply with 488, but this time lost packets should be retransmitted in case of network loss, as long as the OpenSER doesn’t receive a final ACK. Our configuration becomes:

if(loose_route())

{

if(method=="INVITE")

{

sl_send_reply("100","Your Re-INVITE is received");

t_newtran();

t_reply("488","Your Re-INVITE is ignored");

exit;

}

t_relay();

exit;

}

From the software documentation, this should lead to the expected behavior:

TM module enables stateful processing of SIP transactions. The main use of stateful logic, which is costly in terms of memory and CPU, is some services inherently need state. For example, transaction-based accounting (module acc) needs to process transaction state as opposed to individual messages, and any kinds of forking must be implemented statefully. Other use of stateful processing is it trading CPU caused by retransmission processing for memory. That makes however only sense if CPU consumption per request is huge. For example, if you want to avoid costly DNS resolution for every retransmission of a request to an unresolvable destination, use stateful mode. Then, only the initial message burdens server by DNS queries, subsequent retransmissions will be dropped and will not result in more processes blocked by DNS resolution. The price is more memory consumption and higher processing latency.

From user's perspective, the major function is t_relay(). It setup transaction state, absorb retransmissions from upstream, generate downstream retransmissions and correlate replies to requests.

In general, if TM is used, it copies clones of received SIP messages in shared memory. That costs the memory and also CPU time (memcpys, lookups, shmem locks, etc.) Note that non-TM functions operate over the received message in private memory, that means that any core operations will have no effect on statefully processed messages after creating the transactional state. For example, calling record_route after t_relay is pretty useless, as the RR is added to privately held message whereas its TM clone is being forwarded.

TM is quite big and uneasy to program--lot of mutexes, shared memory access, malloc and free, timers--you really need to be careful when you do anything. To simplify TM programming, there is the instrument of callbacks. The callback mechanisms allow programmers to register their functions to specific event. See t_hooks.h for a list of possible events.

Other things programmers may want to know is UAC--it is a very simplistic code which allows you to generate your own transactions. Particularly useful for things like NOTIFYs or IM gateways. The UAC takes care of all the transaction machinery: retransmissions , FR timeouts, forking, etc. See t_uac prototype in uac.h for more details. Who wants to see the transaction result may register for a callback. [kamailio.org]

Handling of Re-Invites on another vendor interconnection

First, we must confirm why our other interconnections (SS7) are not subject to call interruptions. A call trace shows that Re-Invites are similarly sent every 32 seconds by our SIP proxy. The termination router replies with a 200 OK final answer, without intermediary provisional replies which would force the SIP proxy to hang waiting for the final response.

Full capture: [pcap]

The possible loss of 200 (or 488) final replies should then not cause interruptions. Since no provisional response is sent by the termination router, the SIP proxy would retransmit its re-INVITE until reception of a final response. We also assume the termination router would retransmit the 200 (or 488) response, if no ACK is received, conforming to the RFC.

Further work

We will now confirm these hypothesis with tests, apply the necessary configuration changes and study their impact on large scale.

References

Research on the cause of call interruptions (part 1)

Statistics on the global interruptions problem

* * *