Research on the cause of call interruptions

Christian Lathion, 2009-02-03

Switzernet

 

Research on the cause of call interruptions. 1

Description of the problem.. 1

Still to be done. 5

References. 5

 

 

This document describes some tests that were made to identify possible causes of call interruptions (see http://unappel.ch/people/emin-gabrielyan/public/090201-call-interrupts/ for a statistical study of the call interruptions present on our network). It focuses on the technical aspect, to find the cause of a portion of the interruptions highlighted in the statistics. One problem was identified, concerning the SIP exchanges between our proxies and OpenSER termination servers.

Description of the problem

Our SIP proxies periodically issue Re-Invite requests. These are normal INVITE packets, except that they are sent during an already established call. They can be used to modify an existing session. In this case case, these requests are used as keep-alives, to check whether a call is still active or not for billing purposes. These Re-Invites are sent at 32 seconds intervals, both to the customer’s SIP phone and to the termination OpenSER servers. This time interval matches the interval between peaks in our interruption statistics, thus making the Re-Invites good candidates for this problem.

 

We performed a test call which shows the problem. A full packet capture is available in PCAP format [pcap]. It can be opened using the Wireshark protocol analyzer, which was also used to generate the screenshots showing the relevant parts of the SIP exchange. The batch capture was made using ngrep, with the following command:

 

ngrep -pqlt -W byline -O /tmp/090203-testcall.pcap 41215500306 port 5060 or port 1812 or port 1813 or port 3306

 

It matches SIP, RADIUS and MySQL packets. Preliminary tests showed that only the SIP packets are relevant for the concerned call interruptions.

 

For the studied problem, the call can be split in the following parts (times shown are in seconds):

 

0.000  : initial INVITE received from the end-user phone

5.199  : voice channel is established

37.186 : reinvite (normal)

69.258 : reinvite (normal)

101.447: reinvite (normal)

133.544: reinvite (normal)

165.689: reinvite (no 488 response)

198.176: reinvite (not forwarded to vzb)

230.757: reinvite (not forwarded to vzb)

...

428.408: reinvite (not forwarded to vzb)

465.800: call is disconnected

 

The first portion of the call (blue) shows a normal behavior. The first Re-Invite is issued 32 seconds after the call establishment at 37.186s, which is. Re-Invites are correctly answered with a 488 (Not Acceptable Here) response (stating that the session parameters must remain unchanged). The 488 is received by the proxy.

 

The problem starts at 165.689s (green). The Re-Invite is correctly sent, and the proxy receives a 100 provisional response. But the 488 final response does not arrive to the SIP proxy, which means the INVITE transaction remains open.

 

From now on (red), the SIP proxy does not send Re-Invites to the OpenSER server anymore. It keeps waiting for a final response to the INVITE issued at 165.689s. The end-user side still receives the Re-Invites. The call is still active.

 

 

At 465.800s, the SIP proxy sends a BYE request to the end-user phone and vendor connection. The call is now terminated (interrupted). In the calls studied so far, the proxy always waits precisely 300 seconds before terminating the call. This suggests a timer expiration for the still open INVITE transaction, but needs to be confirmed with more call data.

 

 

The following table shows the timings of the four considered calls [xls] identified by their Call-ID. The call appearing on the screenshots is the leftmost. We consider timings once the call is established (column call), setup time is not relevant for the call interruptions problem. The call is established once the end-user device responds a final ACK to the initial INVITE. On the considered calls, it confirms:

 

 

 

d1292aac-6b12af55

2852750226

627379789

5f4b20fd-64ef9052

 

setup

call

reinv int

setup

call

reinv int

setup

call

reinv int

setup

call

reinv int

call initiation

0

 

 

0

 

 

0

 

 

0

 

 

call establishment

5

 

 

8

 

 

17

 

 

21

 

 

reinvite

37

32

32

40

32

32

50

33

32

53

32

32

reinvite

69

64

32

72

64

34

82

65

33

85

64

34

reinvite

101

96

32

106

98

35

115

98

32

119

98

32

reinvite

133

128

32

141

133

32

147

130

33

151

130

32

reinvite

165

160

33

173

165

32

180

163

32

183

162

33

reinvite

198

193

32

205

197

34

212

195

32

216

195

32

reinvite

230

225

32

239

231

33

244

227

33

248

227

32

reinvite

262

257

33

272

264

33

277

260

32

280

259

32

reinvite

295

290

34

305

297

32

309

292

32

312

291

33

reinvite

329

324

32

337

329

32

341

324

33

345

324

32

reinvite

361

356

32

369

361

 

374

357

 

377

356

 

reinvite

393

388

35

 

 

 

 

 

 

 

 

 

reinvite

428

423

32

 

 

 

 

 

 

 

 

 

reinvite

460

455

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

call disconnected

465

460

 

373

365

 

382

365

 

385

364

 

 

 

 

 

 

 

 

 

 

 

 

 

 

faulty reinvite to bye

 

300

 

 

301

 

 

300

 

 

300

 

[xls]

Still to be done

References

  1. Initial handwritten notes on two other calls showing the same pattern. (click to enlarge):

 

(red arrows indicate missing packets)

 

(red arrows indicate missing packets)

 

  1. SIP trace of three interrupted calls (two corresponding to handwritten notes and one additional with same pattern): [zip]

 

  1. Statistics on the global interruptions problem

 

  1. Short remarks on the protocols exchanged between our SIP proxies and billing servers, to identify/discard possible causes of interruptions (email):

 

SSH
The sip servers periodically log to the slave to synchronize (rsync over ssh) the music on hold files that the customers can upload. Not used for anything else.

MySQL
The sip servers update the register database on the master when receiving registers. Not used for anything else.

Maybe the interruptions can be related to the registers (if we suppose interruptions only happen if register information has changed, and since we set default register time to 10 minutes, most problems appear in the first reinvite following the registration refresh). I only propose this because of the 10 minutes coincidence, otherwise I don't see any relationship between register and interruptions.

RADIUS

1. The master does not initiate transactions or keep-alives, it only responds to the requests coming from the sip servers. During a call, we see accounting at beginning and end, authentication at the beginning. When doing reinvites, the sip servers do not send any radius request to the master. Radius packets usually contain the username or call-id in clear text, not always (e.g. when replying access denied, username is not present in the reply).

2. My tests did not show anything special at 10m32s.

3. Some radius requests are lost/retransmitted (graph generated with wireshark on a 15mn capture on one sip server), or take an excessive time to be processed:



But if my hypothesis above are correct, this should not cause call interruptions.

The sip servers are configured to accept 5 second timeouts, with maximum 3 retransmissions:
RAD_TIMEOUT:5
RAD_RETRIES:3

 

  1. Network statistics possibly showing a problem of excessive dropped udp packets on one of the SIP proxies (to investigate):

 

fr4# netstat -n -s -a -p udp
udp:
        4146353490 datagrams received
        0 with incomplete header
        486 with bad data length field
        30929 with bad checksum
        1716626023 with no checksum
        103563531 dropped due to no socket
        19626013 broadcast/multicast datagrams dropped due to no socket
        3086073 dropped due to full socket buffers
        0 not for hashed pcb
        4020046458 delivered
        4150107080 datagrams output

fr4# netstat -s | grep buffer
        3086073 dropped due to full socket buffers
        0 messages dropped due to full socket buffers

 

  1. Configuration option of the SIP proxies (in ser.cfg) concerning timeout on INVITE not receiving a final response:

 

modparam("tm", "fr_inv_timer", 360)

 

The value of 360 for this setting [kamailio.org] does not match our observation of 300 seconds timeout. It needs further investigations.

 

* * *