Prewarm your EBS backed EC2 MySQL slaves

This is the story of cold blocks and mismatched instances and how they will cause you pain and cost you money until you understand why.

Most of the clients that we support run on the Amazon cloud using either RDS or running MySQL on plain EC2 instances using (Provisioned IOPS) PIOPS EBS for data storage.

As expected the common architecture is running a master with one or more slaves handling the read traffic.

A common problem is that after the slaves are provisioned (normally created from an EBS snapshot) they lag badly due to slow IO performance.

Unfortunately what tends to be lost in the “speed of provisioning new resources” fetish is some limitations in terms of data persistence layer (EBS).

If you are using EBS and you have created the EBS volume from snapshot or created a new volume you have to pre-warm the EBS volume otherwise you will suffer a bad (I mean seriously bad) first usage penalty.  Bad? I am talking up to 50% performance drop[1]. So that expensive PIOPS EBS volume you created is going to perform like rubbish every time it reads/writes a cold block.

The other thing which also tends to happen is mixing up the wrong instance (network performance) with the PIOPS EBS. This the classic networked storage, the network is the bottleneck. If your instance type has limited network performance, having a higher PIOPS than the network can handle means you are wasting money (on PIOPS) you can’t use. A bit like in the old days (of dedicated servers and SAN storage) where the SAN could deliver 200-300Mbytes per sec, but the 1 Gigabit network could only do 40-50Mbytes per sec.

Here is the real downside, using the cloud you can provision new resources to handle peak load (in the case more MySQL slaves to handle read load) as fast as you can click, or faster using API calls, or even automagically, if you have some algo forecast the need for additional resources. But… the EBS is all cold blocks, so these new instances will be up and available in minutes but the IO performance will be poor until you either pre-warm or the slave gets around to writing/reading all blocks.

So the common solution is to pre-warm the blocks using dd to read the EBS device (and warm the block) to /dev/null

eg: sudo dd if=/dev/xvdf of=/dev/null bs=1M

Consider how long this will take for any reasonable sized DB (200GBytes) using an instance with 1 Gigabit network.

200Gigabytes read at 50Mbytes/sec  = 200,000 Mbytes/50 = 4000 secs = 3600 (1hr) + 400 (6 mins 40 secs) =~ more than 1 hr.

So you or your algo provisioned a new EC2 instance for the database in minutes but either your IO will be rubbish for an extended period, or you wait more than 1 hr per 200GB to have the EBS pre-warmed.

What are the solutions?

  1. Forecast further in advance depending on the size of your db (or any other persistent storage layer eg NoSQL etc)
  2. Use ephemeral storage and manage the increased risk of data loss in the event of instance termination.
  3. Break your DB or your application into smaller pieces aka micro services.[2]
  4. Pay more $ and have your databases stay around longer so waiting for a instance to be ready in the beginning is not a problem.

As you can expect, most businesses are happy with option 4. Pay more, leave instances around like they were dedicated servers (base load). Amazon is happy too.

Option 3 whilst requiring some thought (argh) and additional complexity is where the real speed of provisioning, dare I say it, agile nature of the cloud will bear the most fruit.

[1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-prewarm.html

[2] http://martinfowler.com/articles/microservices.html

 

Advertisements

MySQL Error: error reconnecting to master

Error message:

Slave I/O thread: error reconnecting to master
Last_IO_Error: error connecting to master

Diagnosis:

Check that the slave can connect to the master instance, using the following steps:

  1. Use ping to check the master is reachable. eg ping master.yourdomain.com
  2. Use ping with ip address to check that DNS isn’t broken. eg. ping 192.168.1.2
  3. Use mysql client to connect from slave to master. eg mysql -u repluser -pREPLPASS –host=master.yourdomain.com –port=3306 (substitute whatever port you are connecting to the master on)
  4. If all steps work, then check that the repluser (the SLAVE replication user has the REPLICATION SLAVE privilege). eg. show grants for ‘repl’@’slave.yourdomain.com’;

Resolution:

  • If step 1 and 2 fail, you have a network or firewall issue. Check with a network/firewall administrator or check the logs if you wear those hats.
  • If Step 1 fails but Step 2 works, you have a DNS or names resolution issue. Check that the slave can connect and resolves names using mysql client or ssh/telnet/remote desktop.
  • If Step 3 fails, you need to check the error reported, it will either be a authentication issue (login failed/denied) or an issue with the TCP port the master is listening on. A good way to verify that port is open is to use: telnet master.yourdomain.com 3306 (or the port the master is listening on) if that fails then there is a firewall(s) in the network which are blocking that port.
  • If you get to step 4 and everything looks fine and the slave does reconnect fine on retrying. Then you have probably had either temporary, network failure, names resolution failure, firewall failure or any of the prior together.

Continuing Sporadic issues:

Get hold of the network and firewall logs.
If this is not possible, setup a script to periodically ping, connect, mysql connect and log that over
time to prove to your friendly network admin that there is an problem with the network.

How MySQL deals with it:

MySQL will try and reconnect by itself after a network failure or query timeout.

The process is governed by a few variables:

master-connect-retry
slave-net-timeout
master-retry-count

In a nutshell, a MySQL slave will try to reconnect after getting a timeout (slave-net-timeout) after waiting the number of seconds in master-connect-retry but only for the number of times
specified in master-retry-count.
By default, a MySQL slave waits one hour before retry, and will then retry every 60 seconds for 86,400 times. That is every minute for 60 days.

If the one hour slave-net-timeout is too long for your DR/Slave read strategy you will need to adjust it accordingly.

Edit: 2011/02/02

Thanks to leBolide. He discovered that there is a 32 character limit on the password for replication.

Have Fun

Paul

P.S. If you liked this post you might be good enough to try these challenges

https://dbadojo.com/2016/07/29/mysql-challenges-part-one/