Research on the cause of call interruptions
Christian Lathion, 2009-02-03
Switzernet
Research on the cause of call interruptions
This document describes some tests that were made to identify possible causes of call interruptions (see http://unappel.ch/people/emin-gabrielyan/public/090201-call-interrupts/ for a statistical study of the call interruptions present on our network). It focuses on the technical aspect, to find the cause of a portion of the interruptions highlighted in the statistics. One problem was identified, concerning the SIP exchanges between our proxies and OpenSER termination servers.
Our SIP proxies periodically issue Re-Invite requests. These are normal INVITE packets, except that they are sent during an already established call. They can be used to modify an existing session. In this case case, these requests are used as keep-alives, to check whether a call is still active or not for billing purposes. These Re-Invites are sent at 32 seconds intervals, both to the customer’s SIP phone and to the termination OpenSER servers. This time interval matches the interval between peaks in our interruption statistics, thus making the Re-Invites good candidates for this problem.
We performed a test call which shows the problem. A full packet capture is available in PCAP format [pcap]. It can be opened using the Wireshark protocol analyzer, which was also used to generate the screenshots showing the relevant parts of the SIP exchange. The batch capture was made using ngrep, with the following command:
ngrep -pqlt -W byline -O /tmp/090203-testcall.pcap 41215500306 port 5060 or port 1812 or port 1813 or port 3306
It matches SIP, RADIUS and MySQL packets. Preliminary tests showed that only the SIP packets are relevant for the concerned call interruptions.
For the studied problem, the call can be split in the following parts (times shown are in seconds):
0.000 : initial INVITE received from the end-user phone
5.199 : voice channel is established
37.186 : reinvite (normal)
69.258 : reinvite (normal)
101.447: reinvite (normal)
133.544: reinvite (normal)
165.689: reinvite (no 488 response)
198.176: reinvite (not forwarded to vzb)
230.757: reinvite (not forwarded to vzb)
...
428.408: reinvite (not forwarded to vzb)
465.800: call is disconnected
The first portion of the call (blue) shows a normal behavior. The first Re-Invite is issued 32 seconds after the call establishment at 37.186s, which is. Re-Invites are correctly answered with a 488 (Not Acceptable Here) response (stating that the session parameters must remain unchanged). The 488 is received by the proxy.
The problem starts at 165.689s (green). The Re-Invite is correctly sent, and the proxy receives a 100 provisional response. But the 488 final response does not arrive to the SIP proxy, which means the INVITE transaction remains open.
From now on (red), the SIP proxy does not send Re-Invites to the OpenSER server anymore. It keeps waiting for a final response to the INVITE issued at 165.689s. The end-user side still receives the Re-Invites. The call is still active.
At 465.800s, the SIP proxy sends a BYE request to the end-user phone and vendor connection. The call is now terminated (interrupted). In the calls studied so far, the proxy always waits precisely 300 seconds before terminating the call. This suggests a timer expiration for the still open INVITE transaction, but needs to be confirmed with more call data.
The following table shows the timings of the four considered calls [xls] identified by their Call-ID. The call appearing on the screenshots is the leftmost. We consider timings once the call is established (column call), setup time is not relevant for the call interruptions problem. The call is established once the end-user device responds a final ACK to the initial INVITE. On the considered calls, it confirms:
|
d1292aac-6b12af55 |
2852750226 |
627379789 |
5f4b20fd-64ef9052 |
||||||||
|
setup |
call |
reinv int |
setup |
call |
reinv int |
setup |
call |
reinv int |
setup |
call |
reinv int |
call initiation |
0 |
|
|
0 |
|
|
0 |
|
|
0 |
|
|
call establishment |
5 |
|
|
8 |
|
|
17 |
|
|
21 |
|
|
reinvite |
37 |
32 |
32 |
40 |
32 |
32 |
50 |
33 |
32 |
53 |
32 |
32 |
reinvite |
69 |
64 |
32 |
72 |
64 |
34 |
82 |
65 |
33 |
85 |
64 |
34 |
reinvite |
101 |
96 |
32 |
106 |
98 |
35 |
115 |
98 |
32 |
119 |
98 |
32 |
reinvite |
133 |
128 |
32 |
141 |
133 |
32 |
147 |
130 |
33 |
151 |
130 |
32 |
reinvite |
165 |
160 |
33 |
173 |
165 |
32 |
180 |
163 |
32 |
183 |
162 |
33 |
reinvite |
198 |
193 |
32 |
205 |
197 |
34 |
212 |
195 |
32 |
216 |
195 |
32 |
reinvite |
230 |
225 |
32 |
239 |
231 |
33 |
244 |
227 |
33 |
248 |
227 |
32 |
reinvite |
262 |
257 |
33 |
272 |
264 |
33 |
277 |
260 |
32 |
280 |
259 |
32 |
reinvite |
295 |
290 |
34 |
305 |
297 |
32 |
309 |
292 |
32 |
312 |
291 |
33 |
reinvite |
329 |
324 |
32 |
337 |
329 |
32 |
341 |
324 |
33 |
345 |
324 |
32 |
reinvite |
361 |
356 |
32 |
369 |
361 |
|
374 |
357 |
|
377 |
356 |
|
reinvite |
393 |
388 |
35 |
|
|
|
|
|
|
|
|
|
reinvite |
428 |
423 |
32 |
|
|
|
|
|
|
|
|
|
reinvite |
460 |
455 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
call disconnected |
465 |
460 |
|
373 |
365 |
|
382 |
365 |
|
385 |
364 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
faulty reinvite to bye |
|
300 |
|
|
301 |
|
|
300 |
|
|
300 |
|
[xls]
(red arrows indicate missing packets)
(red arrows indicate missing packets)
SSH
The sip servers periodically log to the slave to synchronize (rsync over ssh)
the music on hold files that the customers can upload. Not used for anything
else.
MySQL
The sip servers update the register database on the master when receiving
registers. Not used for anything else.
Maybe the interruptions can be related to the registers (if we suppose
interruptions only happen if register information has changed, and since we set
default register time to 10 minutes, most problems appear in the first reinvite
following the registration refresh). I only propose this because of the 10
minutes coincidence, otherwise I don't see any relationship between register
and interruptions.
RADIUS
1. The master does not initiate transactions or keep-alives, it only responds
to the requests coming from the sip servers. During a call, we see accounting
at beginning and end, authentication at the beginning. When doing reinvites,
the sip servers do not send any radius request to the master. Radius packets
usually contain the username or call-id in clear text, not always (e.g. when
replying access denied, username is not present in the reply).
2. My tests did not show anything special at 10m32s.
3. Some radius requests are lost/retransmitted (graph generated with wireshark
on a 15mn capture on one sip server), or take an excessive time to be processed:
But if my hypothesis above are correct, this should not cause call
interruptions.
The sip servers are configured to accept 5 second timeouts, with maximum 3
retransmissions:
RAD_TIMEOUT:5
RAD_RETRIES:3
fr4# netstat -n -s -a -p udp
udp:
4146353490 datagrams received
0 with incomplete header
486 with bad data length field
30929 with bad checksum
1716626023 with no checksum
103563531 dropped due to no
socket
19626013 broadcast/multicast
datagrams dropped due to no socket
3086073 dropped due to full
socket buffers
0 not for hashed pcb
4020046458 delivered
4150107080 datagrams output
fr4# netstat -s | grep buffer
3086073 dropped due to full
socket buffers
0 messages dropped due to full
socket buffers
modparam("tm", "fr_inv_timer", 360)
The value of 360 for this setting [kamailio.org] does not match our observation of 300 seconds timeout. It needs further investigations.
* * *