Recovering from catastrophic failure of the replication master

This applies to a replication architecture like the one detailed here Improving Replication Performance.

My situation was Master2 (which runs the blackhole engine) ran out of disk space and the current binlog was lost. After freeing up disk space on Master2 and restarting its replication it was now missing hours worth of binlog that its slaves are expecting to find. So to remedy this, I pointed the slaves directly back to Master1 to slurp up the missing binlog and then switch them back to Master2 once they were caught up.


========
Login to one of the replication slaves and make sure replication is stopped

mysqladmin stop

Review the final statements processed in the relay log

mysqlbinlog db00-relay-bin.000178 | tail -n 20
# at 51573639
#100706 18:58:13 server id 8  end_log_pos 110500816 Query	thread_id=72390897	exec_time=19	error_code=0
SET TIMESTAMP=1278442693/*!*/;
UPDATE tableA SET views=views+1 WHERE id='123'
/*!*/;
# at 51573817
#100706 18:58:13 server id 8  end_log_pos 110500876 Query	thread_id=72390897	exec_time=19	error_code=0
SET TIMESTAMP=1278442693/*!*/;
COMMIT
/*!*/;

Login to Master1 and find the same block of statements in its binlog

mysqlbinlog master-bin.00135 | less

I like to search via the timestamp in line 3 above.

# at 975963380
#100706 18:58:13 server id 8  end_log_pos 975963462     Query   thread_id=723908
97      exec_time=0     error_code=0
SET TIMESTAMP=1278442693/*!*/;
BEGIN
/*!*/;
# at 975963462
#100706 18:58:13 server id 8  end_log_pos 975963640     Query   thread_id=723908
97      exec_time=0     error_code=0
SET TIMESTAMP=1278442693/*!*/;
UPDATE tableA SET views=views+1 WHERE id='123'
/*!*/;
# at 975963640
#100706 18:58:13 server id 8  end_log_pos 975963667     Xid = 2030221535
COMMIT/*!*/;

On the slave, change it’s master to the actual master using the “end_log_pos” of the final statement

change master to master_host='10.0.1.10', \
	master_user='replicant', \
	master_password='s3cret', \
	master_log_file='mysql-bin.000135', \
	master_log_pos=975963667;
start slave;

Wait for it to catch up

show slave status\G

Stop replication from Master1

stop slave;

Again, look up its final statement

mysqlbinlog db00-relay-bin.000178 | tail -n 20
# at 52117220
#100706 19:03:41 server id 8  end_log_pos 111044278 Query	thread_id=72393308	exec_time=20	error_code=0
SET TIMESTAMP=1278443021/*!*/;
BEGIN
/*!*/;
# at 52117279
#100706 19:03:41 server id 8  end_log_pos 111044428 Query	thread_id=72393308	exec_time=20	error_code=0
SET TIMESTAMP=1278443021/*!*/;
UPDATE tableA SET views=views+1 WHERE id='5647'
/*!*/;
# at 52117429
#100706 19:03:41 server id 8  end_log_pos 111044488 Query	thread_id=72393308	exec_time=20	error_code=0
SET TIMESTAMP=1278443021/*!*/;
COMMIT
/*!*/;

This time locate its location in Master2’s binlog

mysqlbinlog blackhole-bin.000006 | less
# at 111044219
#100706 19:03:41 server id 8  end_log_pos 111044278     Query   thread_id=72393308      exec_time=20    error_code=0
SET TIMESTAMP=1278443021/*!*/;
BEGIN
/*!*/;
# at 111044278
#100706 19:03:41 server id 8  end_log_pos 111044428     Query   thread_id=72393308      exec_time=20    error_code=0
SET TIMESTAMP=1278443021/*!*/;
UPDATE tableA SET views=views+1 WHERE id='5647'
/*!*/;
# at 111044428
#100706 19:03:41 server id 8  end_log_pos 111044488     Query   thread_id=72393308      exec_time=20    error_code=0
SET TIMESTAMP=1278443021/*!*/;
COMMIT
/*!*/;

Now change the slave’s master back to Master2

change master to master_host='10.0.1.20', \
	master_user='replicant', \
	master_password='s3cret', \
	master_log_file='blackhole-bin.000006', \
	master_log_pos=111044488;
start slave;
show slave status\G

Will do a follow up post with how to verify the slave’s have up-to-date and accurate data.