MySQL Replication causing network hangs

ravydavygravy · 09-07-2009 01:39PM #1

Heyho,

I've come across a strange issue, maybe someone could offer some advice on directions to take.

We have a stable working master/slave MySQL configuration with a Master database in Sandyford and several full slaves in Sandyford and Citywest. These work 24/7 with very few issues, and have done so for years.

For a recent ongoing project, I've had to add a new slave in Paris, France. This slave is not replicating every DB - just select ones, using lines like so in /etc/my.cnf:

### Only replicate DBs e and f - ignore a,b,c & d
replicate-ignore-db=a,b,c,d
replicate-wild-do-table=e.%
replicate-wild-do-table=f.%

It works fine, and replicates no issue. But every now and again, the whole slave server seems to lose network connectivity. Through a range of tests, we've established that this only happens when the mysql replication is started - stop the slave and the box behaves fine, never losing a beat; start the slave and it'll disappear off the network for 1-2 mins 5/6 times a day.

I'm a bit puzzled as I've never seen this behaviour in a MySQL slave before. There are no errors on the slave server in any of the logs, but other servers in Paris and Dublin are unable to talk to the server during these blips.

Other possible contributing factors:

We run a bonded pair of nics on the server, but I see no bond or nic errors in /var/log/*
The servers are all running RedHat EL 4, with latest RedHat MySQL RPMs (4.1.20)
The two servers communicate over a VPN, but monitoring of this link by various means shows 100% uptime

Anyone have any ideas about this? It doesn't worry me too much since the server in France will be disconnected from Dublin for good before it goes live, and the issue only occurs when the replication is running (The server ran fine for months from a fresh install before I started the replication on it).

How would you proceed with investigations?

Dave

ravydavygravy · 27-11-2009 03:44PM

So, forgot to post the reason and solution.

The RHEL4 docs for bonding had a syntax error in them, which silently caused bonding to be messed up - both NICs were trying to claim the same MAC address at the same time on different switches, and this was causing network errors when any traffic load (such as replication) went across it. We only came to this conclusion after looking at the cisco logs and noticing the strange mac behaviour.

Heres the correct syntax, versus the example from the bonding docs:

ppc@jourdan ~]$ diff /etc/modprobe.conf /etc/modprobe.conf.20090811
5,6c5
< alias bond0 bonding
< options bond0 mode=1 miimon=100
---
> alias bond0 bonding mode=1 miimon=100

The version split over two lines works as expected (failover in event of NIC or switch failure)

Dave

MySQL Replication causing network hangs

Comments