Monitoring Analysis

André Guimarães, 2012-04-25

Switzernet

1.      Introduction.. 1

2.      The scripts. 1

3.      Comparison.. 2

4.      Missing values. 6

5.      Causes. 8

6.      Resources. 8

 

 

Introduction

This document compares the differences between the amount of registered customers obtained directly from a DBA and from DB2 or PBS. The data in the former 2 is replicated from the Master which has their values inserted by each DBA.

 

A new script was created in the server monitoring.switzernet.com that reads directly from one of the DBAs so we can compare the values and detect possible propagation issues.

 

The scripts

 

Both scripts are executed at the same time by CRON, but the script that reads from DBA finishes always some seconds sooner that the other.

 

The original script reads from location table in DB2.  Its code (without authentications) can be accessed here: [Download]

 

The query used to search data is:

select

  count(l1.username) as number,l1.domain as domain

from location l1,

  (select username,MAX(last_modified) as last_modified from location where last_modified > $DATE group by username ) as l2

where

  l1.username=l2.username and

  l1.last_modified=l2.last_modified

group by

  domain;

 

where DATE is calculated from the current server time – 23 and then converted to Los Angeles time. This query obtains the data of all servers, Astrad and Porta-SIP.

 

The new script reads from location table in DB3. Its code (without authentications) can be accessed here: [Download]

 

The query used to search data is:

select

  count(username) as number,domain

from (

    select * from (

      select domain, username from location2 order by domain

    ) a group by username

  ) b

group

  by domain;

 

This query sorts all users by domain and then groups the users and then domains to guarantee that we have only one registered account per user and that the first user’s account is all in the same server. With this query we obtain only customer’s registered in Astrads. No filters are used to remove expired/unregistered accounts. That is done automatically by MySQL events that run in both Astrad and DBAs.

 

Comparison

 

The following query can be used to compare values in both tables:

select

  a.date,a.users as usersdb2,b.users as usersdb3,a.users-b.users as diff

from (

  select

    sum(cus) as users,concat(left(stat_time,15),'0') as date

  from

    customer_by_host

  where

    host like 'astrad%'

  group by

    left(stat_time,15)

  ) a, (

  select

    sum(cus) as users,concat(left(stat_time,15),'0') as date

  from

    customer_by_host2

  where

    host like 'astrad%'

  group by

    left(stat_time,15)

  ) b

where

  a.date = b.date;

 

This query only selects Astrad server’s customers from both tables and groups them by time. It hen shows the total calculated from each source and the difference between them.

 

By using the query above, we see that normally there is usually only a slight difference between both tables not more than 10 users. This difference might be due to propagation of data as the data is sent by DBA to Master and then replicated to DB2 and PBS1, from where it is read by the original monitoring script. We can see however that at times there are large differences. These differences are marked in red in the image bellow:

 

 

These negative peaks appear in the total registrations graph as each dot is the average of the 2 values of the hour. For instance for the 5:00 dot:
(3905+3698)/2 = 3801.5 rounds up to 3802
As the values in the graphs bellow include Porta-SIP servers, we need to add the the users in FR4 (1) and US1 (21)
3802+(21+1)   = 3824

 

 

For some reason not yet investigated those negative peaks are not shown in the graph bellow that is obtained from the same table.

 

 

In the new registration graph, that was developed by Sérgio and uses exactly the same values, it appears:

 

 

In this graph it is now possible to see to what the difference is due. In the first time period that was marked in red we see that we are missing Astrad 14 and Astrad 3 values, in the second Astrad 17 is missing, in the third Astrad 9 is missing and on the last one we see that Astrad17 is also missing.

 

In all this cases there isn’t a gradual decrease of value. There is no value at all.

Missing values

 

With the following query the number of missing values for each server was obtained:

select ip, count(stat_time) from customer_by_host where stat_time > '2012-04-24 09:58:00' group by ip;

 

Which returns:

 

+-----------------+------------------+

| ip              | count(stat_time) |

+-----------------+------------------+

| 176.31.102.152  |               52 |

| 176.31.247.50   |               50 |

| 213.251.169.218 |               52 |

| 213.251.170.99  |               52 |

| 66.234.138.73   |               52 |

| 82.103.128.3    |               52 |

| 91.121.117.76   |               51 |

| 91.121.121.115  |               52 |

| 91.121.122.64   |               51 |

| 91.121.138.5    |               52 |

| 91.121.142.9    |               51 |

| 91.121.143.56   |               52 |

| 91.121.147.45   |               52 |

| 91.121.151.58   |               52 |

| 91.121.151.75   |               52 |

| 91.121.16.79    |               52 |

| 91.121.167.75   |               52 |

| 91.121.172.156  |               51 |

| 91.121.178.108  |               52 |

| 91.121.204.186  |               52 |

| 91.121.205.108  |               51 |

| 91.121.70.119   |               52 |

| 91.121.75.124   |               52 |

| 91.121.99.16    |               52 |

+-----------------+------------------+

 

All these missing values happened only after yesterday. There are no similar problems before. Bellow the previous and after values are shown:

 

| 498466 | astrad   | 91.121.122.64  |   3 | 2012-04-25 00:00:08 | NULL        |

| 498514 | astrad   | 91.121.122.64  |   3 | 2012-04-25 01:00:08 | NULL        |

| 498085 | astrad3  | 91.121.117.76  | 223 | 2012-04-24 16:00:08 | NULL        |

| 498133 | astrad3  | 91.121.117.76  | 225 | 2012-04-24 17:00:08 | NULL        |

| 498636 | astrad6  | 91.121.142.9   |  24 | 2012-04-25 03:30:07 | NULL        |

| 498684 | astrad6  | 91.121.142.9   |  24 | 2012-04-25 04:30:08 | NULL        |

| 498639 | astrad9  | 91.121.205.108 | 162 | 2012-04-25 04:00:08 | NULL        |

| 498687 | astrad9  | 91.121.205.108 | 159 | 2012-04-25 05:00:08 | NULL        |

| 498076 | astrad14 | 91.121.172.156 | 175 | 2012-04-24 16:00:08 | NULL        |

| 498124 | astrad14 | 91.121.172.156 | 176 | 2012-04-24 17:00:08 | NULL        |

| 498425 | astrad17 | 176.31.247.50  | 198 | 2012-04-24 23:30:08 | NULL        |

| 498473 | astrad17 | 176.31.247.50  | 203 | 2012-04-25 00:30:08 | NULL        |

| 498665 | astrad17 | 176.31.247.50  | 197 | 2012-04-25 04:30:08 | NULL        |

| 498713 | astrad17 | 176.31.247.50  | 200 | 2012-04-25 05:30:07 | NULL        |

 

The same query executed in the DBA table doesn’t show any missed values.

 

At the indicated periods there were no reported replication problems or peaks in monitoring.

Causes

 

For now the cause for this is unknown.

Resources

 

Registration by server graph

http://monitor.switzernet.com/110207-register/graph/stacked-vertical-bar-graph.php?data=../data/monit-bar-data1.txt&config=../data/monit-bar-config1.txt

 

Registration totals graph

http://monitor.switzernet.com/110207-register/graph/line-graph.php?data=../data/monit-line-data1.txt&config=../data/monit-line-config1.txt

 

New registration by server graph

http://monitor.switzernet.com/120222-registered/per-server.php?hours=72