Saturday, 22 August 2020

Recovering from a replication failure in a MariaDB Master/Master replication setup

For the purposes of this post I'll assume we are working with two MariaDB servers that have been configured to perform master/master replication and one of them has failed. In this case Server01 is healthy while Server02 has stop replicating.

We need to firstly ensure that no queries are hitting Server02 / the failed server - this will typically be a case of stopping services / blocking network access to services that hit it. e.g. stopping httpd.

We'll also want to ensure replication is stopped on the failed server (Server02):

SERVER02> stop slave;

Now on Server01 / the working server issue:

SERVER01> stop slave;

SERVER01> flush tables with read lock; (This will temporarily stop it updating)

SERVER01> show master status;

We'll make a note of the above command - it should read something like:

File: mysql-bin.123456

Position 123

Binlog_Do_DB: <replicated_database>

Then on Server01 / the working server take a backup of the database:

SERVER01> mysqldump -u<username> -p --lock-tables --databases <database-name[s]> > export.sql

and on Server02 / the failed server - restore the backup:

SERVER02> mysql -u root -p < export.sql

Now on Server01 / the working server issue the following command to start processing changes again:

SERVER01> UNLOCK TABLES;

Then on Server02 / the failed server issue the following to repoint the logs (use the information above we recorded from Server01):

SERVER02> CHANGE MASTER TO master_log_file='mysql-bin.xxxxxx', master_log_pos=yy;

SERVER02> START SLAVE;

To verify we can issue:

SERVER02> show slave status \G

Now we need to do the reverse by ensuring Server01 / working server replicates from Server02 / failed server. On Server02 issue:

SERVER02> show master status \G

Record the output again.

Now on Server01 / the working server set the logs:

SERVER01> CHANGE MASTER TO master_log_file='mysql-bin.xxxxxx', master_log_pos=yy;

SERVER01> START SLAVE;

and then to verify replication issue:

SERVER01> SHOW SLAVE STATUS \G

Finally reverse anything you performed at the start to block comms with Server02 / the bad server e.g. start services, update firewall etc.