Call interruption problem in the VOIP network of Switzernet

Emin Gabrielyan

Switzernet

2009-01-31

 

It’s been long time customers of Switzernet complain that phone calls via are being frequently interrupted after several minutes of conversation. So far, the exact source of the problem is still not known but several possible reasons are being explored already. It seems that all interruptions are characterized by a specific pattern.

 

Call interruption problem in the VOIP network of Switzernet 1

1.     Introduction. 1

2.     Three different interruption patterns. 2

2.1.      Interruptions at 32 seconds intervals starting from 5th minute of the call 3

2.2.      Interruptions at 10 minutes 32 seconds and at 15 minutes 32 seconds. 4

2.3.      Wide spread interruption at 669th second. 4

2.4.      Estimation of probabilities of various types of interruptions. 5

3.     Probability of the end of call 5

4.     Number of calls longer than a given duration. 7

5.     Data retrieval, processing, and files. 8

6.     Further work. 9

7.     References: 9

7.1.      Previous researches on interruptions. 9

7.2.      References on transaction and dialog mismatch with OpenSER server and VZB links  10

8.     Glossary. 10

 

1.   Introduction

 

Recently a new national interconnection has been established. Since the beginning of 2009 an important volume of calls to Swiss landlines is routed through the new interconnection. We analyzed call interruption pattern to Swiss landlines via the old interconnection (VZB - during November 2008) and via the new interconnection (CLT - during January 2009).

 

Interruptions are well identifiable on graphs showing distribution of calls by their durations. The two graphs corresponding to two studied interruptions differ quite significantly when it concerns the interruptions. The chart below shows two noisy curves. The red one corresponds to calls routed via the main interconnection in November 2008 (vzb), and the blue one corresponds to calls routed via the new interconnection in January 2009 (clt ss7). The blue chart (instead of being nearly overlapped with the red one) is flipped vertically, for the sake of visual clarity only. The horizontal axis represents durations of calls in seconds and the vertical axis represents (in both directions) the number of calls in the concerned monthly CDR having the given duration. Numbers of the blue chart grow downward. All labels represent the durations in seconds.

The red chart has peaks at regular intervals. Abnormally high numbers of calls with very specific durations, points out to the high probability of call termination at those specific durations. The peaks on the red graph show that the calls were interrupted by reason other than users’ natural disconnections.

 

Numerous peaks of the red graph do not exist on the blue one. Only the interruptions at 632nd second of conversation and a wide spread peak near 672-th second (~11m) is present on both charts. This means that the secondary interconnection probably is not affected by all interruption problems. We therefore can further reduce the scope of search of the reason. The rest of the document shows a deeper analysis of the interruption pattern of the main faulty interconnection containing the numerous periodical peaks.

 

2.   Three different interruption patterns

 

We are possibly dealing with three different reasons of interruption. Such a suggestion comes from the graph below zooming out the distribution of calls of November via the faulty interconnection. We are interested in a region between 300 seconds (5 min) and 1200 seconds (20 min). The light-orange noisy curve represents the distribution of calls retrieved from CDR (the Y axis being the quantity). The three other curves are different interpolations required for computation of probabilities. Red dots mark the points of frequent interruptions and show the call duration value in seconds:

 

If we forget about two interruptions at 632 and 932 seconds (the two highest peaks, relatively to the base curve), all remaining narrow-band interruptions occur starting from 300-th second at multiples of 32 seconds.

2.1.                    Interruptions at 32 seconds intervals starting from 5th minute of the call

The interval between interruption moments is not exactly 32 seconds but is about 32.07. The following table shows the values of duration at the interruption points of the graph. Read the numbers column by column. Successive numbers in each column are spaced by 32 second. Additional second is gained when we shift to the next column, meaning that the interval between possible points of interruption is slightly more than 32 seconds.

 

621

910

 

653

942

364

685

974

396

717

1006

428

749

1038

460

781

1070

492

813

1102

524

845

1134

556

877

 

588

909

 

620

 

 

 

These interruptions are possibly related to keep-alive signaling messages of SIP servers or of the billing server (AAA messages between billing and SIP). Transmission of these presumable keep-alive messages starts at 5th minute of the conversation. The fact that there are no call interruptions on 300th and 332nd seconds is possibly related to the fact that billing cuts the call only on the 3rd consecutive failure of the keep-alive control. Interruptions on the 364th second therefore mean that consecutive keep-alive controls of 300th, 332nd, and 364th are failed.

2.2.                    Interruptions at 10 minutes 32 seconds and at 15 minutes 32 seconds

Interruptions at 632nd and 932nd seconds (marked by an asterisk ‘*’ on the graph) do not fit the main pattern of 32 second intervals (that begin from the 5th minute). These could be another keep-alive control ‘accidentally’ launched at 10th and 15th minutes of conversation.

 

If this hypothesis is true, with the yet new keep-alive of 10th and 15th minutes the calls are cut only on the second failure (and not on the third), because the interruptions occur on the 632nd and 932nd seconds and not on 664th and 964th seconds respectively.

 

A serious question and concern is why a new keep-alive control starts at all, on 5th, 10th, and 15th minutes of conversation? There is already a keep-alive control launched from the beginning of the call. The source of the two types of interruptions, (a) occurring at periodic durations of 364, 396 … seconds and (b) at two moments of conversation on 632nd and 932nd seconds, can be the same reason: a bug, unexpectedly launching an extra keep-alive control.

2.3.                    Wide spread interruption at 669th second

There is yet another region of interruptions with a peak at 669th second of conversation (~11m). These interruptions do not occur at an exact duration and do not represent a narrow-band peak. They are spread in a 30 second wide region. There is possibly another 30 second wide region of spread interruptions around 625th second. We do not have any hypothesis concerning this type of interruptions.

 

2.4.                    Estimation of probabilities of various types of interruptions

We estimate that the two first types of interruptions (possibly related to keep-alive controls) may occur during a call with a probability of 5.08%. The second type of wide spread interruptions may occur during a call at a probability of about 8.26%.

 

3.   Probability of the end of call

A yet clearer separation of interruption moments can be achieved by displaying probabilities of interruptions as a function of the call duration. On a 2 hour scale the chart (with statistics of November) looks as follows:

 

The vertical axis represents the probability of interruption of calls during the given second. When zooming to the region from 5 to 19 minutes we clearly identify all types of interruption peaks discussed in the previous section.

 

The scale of the chart is fixed to show labels and vertical gridlines every 32.07 seconds (starting from the 300th second). We see that now the periodic interruptions match the grid almost exactly, without the accumulated gain of 2 seconds when approaching the 20th minute of the conversation.

 

The average probability of interruption at critical durations, is equal to 0.42%, while the probability of natural end of call during the entire period,  is equal to 0.15%.

 

Assuming that the interruptions are caused simply by network losses of keep-alive packets, and considering therefore the three consecutive losses of keep-alive packets, the packet loss rate,  can be computed as follows:

 

The result of 14.07% is too high to be true. Such an hypothesis shall be also waived out due to observations of the new interconnection link, where the periodic interruptions are not present at all.

 

A seriously suspected reason is a mismatch of transaction or of dialog by the OpenSER server running on VZB links (References on transaction and dialog mismatch with OpenSER and VZB).

 

Another reason can be a buggy launch of the 2nd keep-alive control at the 5th minute of the conversation. A question to answer is why the 2nd buggy keep alive does not pose problems with the new interconnection?

4.   Number of calls longer than a given duration

It may appear to be easier when thinking in terms of calls lasting longer than a given duration. From the point of view of lasting call statistics, the number of calls shall drop significantly at points of high interruption probability. The interruption points are not visually identifiable on the chart of the lasting calls.

 

However the phenomenon can be identified with a detailed look. The diagram below shows the chart for a duration scale within a range from 5 to 15 minutes. At a four points of discussed interruptions, the curve is zoomed-in and we can see the significance of the drop of the number of lasting calls at these points.

 

5.   Data retrieval, processing, and files

 

CDR download instructions [email]

 

CDR download log [log.txt]

 

Call duration statistic’s retrieval script [sh.txt]

 

Comparison of two links [xls]

 

Analysis of the faulty link [xls]

 

6.   Further work

 

Computation of the interruption probability as a function of time of day

 

Finding the fraction of affected phone numbers (answer the question whether all numbers are affected or only a specific subset)

 

Figure out the SIP server and the types of devices in case if only a specific set of phone numbers is affected

 

Retrieve statistics after reconfiguration of re-INVITE replies (transmit error instead of ok)

 

Search a way to reconfigure billing to wait for 4 retransmissions before cutting the call, observe the statistics

 

7.   References:

 

Call interruption problem in the network of Switzernet, study of durations and probabilities (this document):

http://unappel.ch/people/emin-gabrielyan/public/090201-call-interrupts/

http://4z.com/people/emin-gabrielyan/public/090201-call-interrupts/

http://switzernet.com/people/emin-gabrielyan/090201-call-interrupts/

 

7.1.                    Previous researches on interruptions

 

Identifying the interruptions on call statistics on a statistical simulation:

http://switzernet.com/public/081118-call-interruptions/

http://unappel.ch/public/081118-call-interruptions/

 

Identifying abnormal interruptions at 10m32s and at ~11m on true raffic:

http://switzernet.com/public/081119-interuption-d-appel/

http://unappel.ch/public/081119-interuption-d-appel/

 

Identifying interruptions on analytically simulated model:

http://unappel.ch/people/emin-gabrielyan/public/081120-call-interruptions/

 

An attempt to find a connection between interruptions and different factors such as device type, SIP server, user’s ISP:

http://switzernet.com/public/081223-call-interruptions-statistics/

http://switzernet.com/company/081223-call-interruptions-statistics/

 

Procedure of refunds to affected customers:

http://switzernet.com/company/081124-refund-interruption-calls/

 

7.2.                    References on transaction and dialog mismatch with OpenSER server and VZB links

 

SIP related docs:

http://switzernet.com/people/emin-gabrielyan/070403-sip-invite-cancel/

http://switzernet.com/people/emin-gabrielyan/070410-SIP-transactions/

http://switzernet.com/people/emin-gabrielyan/070412-SIP-record-route/

http://switzernet.com/people/emin-gabrielyan/070424-sip-authentication/

http://switzernet.com/people/emin-gabrielyan/070430-sip-transactions-fr/

http://switzernet.com/people/emin-gabrielyan/070501-sip-docs/

http://switzernet.com/people/emin-gabrielyan/070528-perl-primisip/

http://switzernet.com/people/emin-gabrielyan/070605-perl-primisip/

http://switzernet.com/people/emin-gabrielyan/070605-primisip/

 

Openser related docs:

http://switzernet.com/people/emin-gabrielyan/070413-openser-transactions/

http://switzernet.com/people/emin-gabrielyan/070416-openser-loops/

http://switzernet.com/people/emin-gabrielyan/070523-openser-repeating-ack/

http://switzernet.com/people/emin-gabrielyan/070524-openser-invite-100-handling/

http://switzernet.com/people/emin-gabrielyan/070612-openser-crontab-restart/

http://switzernet.com/people/emin-gabrielyan/070615-openser-unmatched-replies/

 

Verizon related problems:

http://switzernet.com/people/emin-gabrielyan/070523-verizon-portasip-reinvite/

http://switzernet.com/people/emin-gabrielyan/070523-verizon-unrecognized-ack/

http://switzernet.com/people/emin-gabrielyan/070525-verizon-repeated-487-status/

http://switzernet.com/people/emin-gabrielyan/070525-verizon-unprocessed-ack/

http://switzernet.com/people/emin-gabrielyan/070528-verizon-unprocessed-ack/

 

http://4z.com/People/emin-gabrielyan/public/070619-rroute-fix-replsl/

http://switzernet.com/people/emin-gabrielyan/070619-rroute-fix-replsl/

 

8.   Glossary

 

AAA stands for “authentication, authorization and accounting” [wiki]

 

SIP stands for “Session Initiation Protocol” [rfc3261]

 

VOIP stands for “Voice over IP”

 

CDR stands for “Call Data Records”

 

ISP stands for “Internet Service Provider”

*   *   *

Copyright © 2009 Switzernet