André Guimarães, 2012-04-25
Switzernet
1.
Introduction
2.
The scripts
3.
Comparison
5.
Causes
6.
Resources
This document compares the differences between the amount of registered customers obtained directly from a DBA and from DB2 or PBS. The data in the former 2 is replicated from the Master which has their values inserted by each DBA.
A new script was created in the server monitoring.switzernet.com that reads directly from one of the DBAs so we can compare the values and detect possible propagation issues.
Both scripts are executed at the same time by CRON, but the script that reads from DBA finishes always some seconds sooner that the other.
The original script reads from location table in DB2. Its code (without authentications) can be accessed here: [Download]
The query used to search data is:
select
count(l1.username) as number,l1.domain as domain
from location l1,
(select username,MAX(last_modified) as last_modified from location where last_modified > $DATE group by username ) as l2
where
l1.username=l2.username and
l1.last_modified=l2.last_modified
group by
domain;
where DATE is calculated from the current server time – 23 and then converted
to
The new script reads from location table in DB3. Its code (without authentications) can be accessed here: [Download]
The query used to search data is:
select
count(username) as number,domain
from (
select * from (
select domain, username from location2 order by domain
) a group by username
) b
group
by domain;
This query sorts all users by domain and then groups the users and then domains to guarantee that we have only one registered account per user and that the first user’s account is all in the same server. With this query we obtain only customer’s registered in Astrads. No filters are used to remove expired/unregistered accounts. That is done automatically by MySQL events that run in both Astrad and DBAs.
The following query can be used to compare values in both tables:
select
a.date,a.users as usersdb2,b.users as usersdb3,a.users-b.users as diff
from (
select
sum(cus) as users,concat(left(stat_time,15),'0') as date
from
customer_by_host
where
host like 'astrad%'
group by
left(stat_time,15)
) a, (
select
sum(cus) as users,concat(left(stat_time,15),'0') as date
from
customer_by_host2
where
host like 'astrad%'
group by
left(stat_time,15)
) b
where
a.date = b.date;
This query only selects Astrad server’s customers from both tables and groups them by time. It hen shows the total calculated from each source and the difference between them.
By using the query above, we see that normally there is usually only a slight difference between both tables not more than 10 users. This difference might be due to propagation of data as the data is sent by DBA to Master and then replicated to DB2 and PBS1, from where it is read by the original monitoring script. We can see however that at times there are large differences. These differences are marked in red in the image bellow:
These negative peaks appear in the total registrations
graph as each dot is the average of the 2 values of the hour. For instance for
the 5:00 dot:
(3905+3698)/2 = 3801.5 rounds up
to 3802
As the values in the graphs bellow include Porta-SIP
servers, we need to add the the users in FR4 (1) and
US1 (21)
3802+(21+1) = 3824
For some reason not yet investigated those negative peaks are not shown in the graph bellow that is obtained from the same table.
In the new registration graph, that was developed by Sérgio and uses exactly the same values, it appears:
In this graph it is now possible to see to what the difference is due. In the first time period that was marked in red we see that we are missing Astrad 14 and Astrad 3 values, in the second Astrad 17 is missing, in the third Astrad 9 is missing and on the last one we see that Astrad17 is also missing.
In all this cases there isn’t a gradual decrease of value. There is no value at all.
With the following query the number of missing values for each server was obtained:
select ip, count(stat_time) from customer_by_host where stat_time > '2012-04-24 09:58:00' group by ip;
Which returns:
+-----------------+------------------+
| ip | count(stat_time) |
+-----------------+------------------+
| 176.31.102.152 | 52 |
| 176.31.247.50 | 50 |
| 213.251.169.218 | 52 |
| 213.251.170.99 | 52 |
| 66.234.138.73 | 52 |
| 82.103.128.3 | 52 |
| 91.121.117.76 | 51 |
| 91.121.121.115 | 52 |
| 91.121.122.64 | 51 |
| 91.121.138.5 | 52 |
| 91.121.142.9 | 51 |
| 91.121.143.56 | 52 |
| 91.121.147.45 | 52 |
| 91.121.151.58 | 52 |
| 91.121.151.75 | 52 |
| 91.121.16.79 | 52 |
| 91.121.167.75 | 52 |
| 91.121.172.156 | 51 |
| 91.121.178.108 | 52 |
| 91.121.204.186 | 52 |
| 91.121.205.108 | 51 |
| 91.121.70.119 | 52 |
| 91.121.75.124 | 52 |
| 91.121.99.16 | 52 |
+-----------------+------------------+
All these missing values happened only after yesterday. There are no similar problems before. Bellow the previous and after values are shown:
| 498466 | astrad | 91.121.122.64 | 3 | 2012-04-25 00:00:08 | NULL |
| 498514 | astrad | 91.121.122.64 | 3 | 2012-04-25 01:00:08 | NULL |
| 498085 | astrad3 | 91.121.117.76 | 223 | 2012-04-24 16:00:08 | NULL |
| 498133 | astrad3 | 91.121.117.76 | 225 | 2012-04-24 17:00:08 | NULL |
| 498636 | astrad6 | 91.121.142.9 | 24 | 2012-04-25 03:30:07 | NULL |
| 498684 | astrad6 | 91.121.142.9 | 24 | 2012-04-25 04:30:08 | NULL |
| 498639 | astrad9 | 91.121.205.108 | 162 | 2012-04-25 04:00:08 | NULL |
| 498687 | astrad9 | 91.121.205.108 | 159 | 2012-04-25 05:00:08 | NULL |
| 498076 | astrad14 | 91.121.172.156 | 175 | 2012-04-24 16:00:08 | NULL |
| 498124 | astrad14 | 91.121.172.156 | 176 | 2012-04-24 17:00:08 | NULL |
| 498425 | astrad17 | 176.31.247.50 | 198 | 2012-04-24 23:30:08 | NULL |
| 498473 | astrad17 | 176.31.247.50 | 203 | 2012-04-25 00:30:08 | NULL |
| 498665 | astrad17 | 176.31.247.50 | 197 | 2012-04-25 04:30:08 | NULL |
| 498713 | astrad17 | 176.31.247.50 | 200 | 2012-04-25 05:30:07 | NULL |
The same query executed in the DBA table doesn’t show any missed values.
At the indicated periods there were no reported replication problems or peaks in monitoring.
For now the cause for this is unknown.
Registration by server graph
Registration totals graph
New registration by server graph
http://monitor.switzernet.com/120222-registered/per-server.php?hours=72