Hi Nathan,
Our first priority is to bring the server back, and we were focussed on that. There was no "silence" when I was made aware of the issue, and this was communicated in the enterprise forum.
What happened was enterprise dropped and our techs could not reboot it from the reboot console. So they assigned this to L3 which is myself and Steve and we were supposed to receive an sms from Alertra, but this sms did not arrive. I came on shift several hours later and to my horror to see the server being down, so tried to reboot it myself, but no go there. I posted the announcement, contacted the colo where we had them manually reboot the server. This still did not bring back the server, so the colo's remote hands hooked up a crash cart to the server and were having problems getting the server to load up. Each time a fsck was attempted, this would fail, and eventually the server was able to fsck properly and load up.
That's what happened. It's not pretty, but the truth is not always pretty. What's been done to ensure this does not happen again - well we can't prevent a server from dropping or having issues, but we can prevent us not knowing about it if a monitoring system fails to report it being down. So I've done something I should have done years ago, and setup another 3rd party monitoring account for all Dotable servers, instead of just relying on Alertra.
|