Outage: Feb 14 04:39 to 06:35 ~2 Hours

Unfortunately we had a Valentine’s Day outage of around 2 hours.

Incident Timeline: (times in UTC)

04:39 - Our monitoring sees a 50x error.
04:41 - I am alerted via email & phone.
04:48 - I acknowledge the incident and start investigating
04:50 - I cannot access the VM via SSH. I issue a reboot via our control panel.
04:54 - Our server has a load of 12 and an 57% of all IO operations are IOWait.
05:30 - I issue another reboot and can’t seem to figure out what’s wrong
05:58 - I lodge a ticket with our provider to check the host, and to power off and on again as we still have huge IOWait values, and 100% Memory usage.
06:30 - hosting company hasn’t got back to me and I start investigating by rolling back the latest configuration changes I’ve done & reboot.
06:35 - sites are back online.

Resolution

Latest change included turning on huge pages with a value of 100MB to allow postgres to get some performance gains.
This change was done on Monday morning and I had planned to do a power cycle this week to confirm everything was on the up-and-up. Turns out my host did that for me.

The outage lasted longer than it should have due to some $job and $life.

Until next time,
Cheers,
Tiff