Packet loss on our SIP proxies
Christian Lathion, 2009-02-16
Switzernet
Packet loss on our SIP proxies
Listening ports on the SIP proxy
Packets dropped on all SIP proxies
Researches on call interruptions observed on our network [2009-02-06] pointed that a portion of interrupted calls are caused by network loss of signaling SIP packets. The forced call interruption was solved by enabling retransmission at the signaling level (SIP uses UDP transport, it must manage packet retransmission by itself). Even in the case the resulting problem is fixed, we must ensure that packet loss remains at an acceptable level on our proxies. In this document, we will focus on packets which are dropped by our SIP proxies. It can be signaling packets (a very small portion of the total traffic) or media RTP packets.
First, we present the general behavior of our SIP proxies. They manage SIP signaling and media packets (media could directly pass from one peer to the other, it is not the case here). Signaling uses fixed udp ports: 5060, 5061 and 5062. Media uses random ports, in a range above 40000. Media sockets are not permanently active, but only when a call takes place. A random port number is assigned by the server for each call leg, and transmitted to other peer in the SDP part of the SIP packets. If RTP packets reach the server without being part of an active call, they will be dropped since no socket listens on the port. The port number of active media sessions can be seen in the list of active sockets. The following command lists all active ipv4 udp sockets. Media ports are highlighted in blue:
fr4# netstat -a -p udp -f inet
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
udp4 0 0 fr4.55428 *.*
udp4 0 0 fr4.51093 *.*
...
...
udp4 0 0 fr4.49129 *.*
udp4 188 0 fr4.49128 *.*
udp4 0 0 *.64878 *.*
udp4 0 0 localhost.domain *.*
udp4 0 0 fr4.5060 *.*
udp4 0 0 fr4.5061 *.*
udp4 0 0 fr4.5062 *.*
udp4 0 0 localhost.ntp *.*
udp4 0 0 fr4.ntp *.*
udp4 0 0 *.ntp *.*
udp4 0 0 *.syslog *.*
To obtain the statistics we need (packets received and dropped), we use the following syntax of netstat. It outputs statistics on ipv4 udp packets only:
fr4# netstat -s -p udp -f inet
udp:
32932 datagrams received
0 with incomplete header
0 with bad data length field
0 with bad checksum
18826 with no checksum
2775 dropped due to no socket
1 broadcast/multicast datagram dropped due to no socket
0 dropped due to full socket buffers
0 not for hashed pcb
30156 delivered
29635 datagrams output
We focus on unicast packets dropped due to no socket (blue), since it represents by far the majority of dropped packets. We used a simple script for the statistics collection. The script periodically invokes netstat and stores the number of received and selected dropped packets in a csv file:
#! /usr/local/bin/bash
while (true);
do (echo "date: " `date +"%Y-%m-%d-%H:%M"`; netstat -s -p udp –f inet) |
awk '
/^date:/ {date=$2;};
/[0-9]+ datagrams received/ {recv=$1;}
/[0-9]+ dropped due to no socket$/ {drop=$1;}
END {
print date","recv","drop;
}';
sleep 60;
done >> `date +"%Y-%m-%d"`"-"`hostname`
Our observations first highlighted a particular case. One of the SIP proxies (fr4) showed an abnormally high rate of dropped packets: approx. 4% during daytime, up to 65% during the night. These are impossible values; no service would run with such a high packet loss. The fact that the drop rate peaks during the night (when the call volume is close to zero) indicates that a constant packet flow reaches the server, and is dropped since it does not match any active call (i.e. no open socket on the proxy). The following graph shows the dropped packets percentage for the rough data (green) and the “scaled” data if we subtract 100 dropped packets per second (purple). 100 packets/sec appears to be a good estimate of the constant flow, as it brings the curve back to “normal” (i.e. near 0%) values and eliminates nightly peaks:
To identify the source of this constantly dropped traffic, we launch a capture on the fr4 proxy (91.121.75.124). An analysis of the collected data shows that the proxy receives a constant flow of udp packets on port 37734, the source being one of our SS7 gateways (77.59.226.99). Port 37734 is not listening on the SIP proxy; packets are dropped and replied with ICMP Destination unreachable messages:
This traffic is also visible on the bandwidth usage of the concerned SS7 link. Usage never reaches zero, even at night when the link is idle (here, between 2 and 6am):
On the SS7 gateway, we identified a call which was stalled in the CONNECTING state. It appears to be stuck because of a bad handling of incomplete number format (in this case 0041) on the gateway:
Manually disconnecting the call immediately terminates the stalled media session and resets the dropped packets rate on the SIP proxy to a normal state:
Configuration of the SS7 gateway will be updated to block such incomplete number patterns. The problem was not reproducible by our tests. It could arise only on particular circumstances (e.g. heavy load of the gateway). This problem should not have any impact on calls.
Using the previously presented script, we gathered the dropped packets rate on 3 SIP proxies (fr1, fr4 and dk1) over several days. The following graphs present the obtained data for the 3 servers on a period of two days (2009-02-14 and 2009-02-15):
Full data for the 3 servers: [xls]
These statistics do not give precise information as-is. Some of the peaks match our call peak hours, while others don’t. The peaks neither match the load distribution of the three servers. On the fr4 proxy, we still observe a constant flow of dropped packets during the night, which needs further investigations. dk1 encountered a problem similar as described in Constant dropped packets between 12:30 and 16:00 on 2009-02-14, which resolved by itself.
The statistics presented in this document are only a short introduction of the problem, to give a first overview of the points to analyze with more details. For now, we can only give general conclusions.
On a normal situation, we should not see such dropped packets. One of the causes could be that one of the peers (customer or vendor) keeps sending media after the call was terminated and the SIP proxy closed the port. This should not happen; the port should only be closed after the last transaction of the call (BYE packet and OK response) is finished.
The peaks could still be related to call interruptions. At the present time, the 632 seconds interruption peak is still present, with another wider interruptions peak (see [2009-02-01] for details). For interrupted calls, the SIP proxy sends a BYE request by itself. There can be a time interval before both the vendor and customer have terminated the call. During this interval, they can keep sending media packets to the proxy, which would already have closed its media ports and thus would drop the calls. The behavior or the SS7 gateway generating stalled media sessions must also be studied with more details, as it could be a source of dropped packets.
Finally, statistics must be made with more combined data to give a more precise view of the problem. For the moment, it is not clear if these dropped packets have a real impact on communications or are only invalid packets.
* * *