I am in Atlanta, Georgia today, which is normally around a 7-hour flight from Seattle. I won’t go into all the details, but after almost every issue you can face on an airplane, it took my 21 hours to get here. Needless to say, I’m a little sleepy as I type this!
Now I don’t blame the airlines for the issues they faced – a broken airplane and weather that diverted our flight from one place to another – but what I did want to mention was the way they handled the situation, and how we DBA’s can learn from their mistakes. The issues involved two major things that each part of the carrier I flew repeated over and over: bad processes and no information. We weren’t told what was going on, or what they or we could do to make things better.
So here are the lessons you and I can take away from my nightmare flight:
1. Keep your systems maintained and monitored
If you make sure that your systems are properly maintained, you can avoid many issues to begin with. If the aircraft I was on had been newer or better maintained and monitored, there would have been a less likely chance that we would have had the first issue that started the downward trend. Make sure you keep up with your systems from both a hardware and software standpoint, and set up good monitoring practices so that you can catch issues before they arise. The best way to deal with a problem is not to have it.
2. Let people know what’s going on, what you’re doing about it, and when you think you will be done.
Our big frustration after 2.5 hours on the runway at the wrong airport (that was about the third problem of the day) was that we had no information – not just us, but the flight attendants as well. Eventually, you’ll have an issue on a system that you take care of. Remember that lots of folks use the systems we run, so they need to know what’s going on – it’s a basic human desire.
At Microsoft, we get detailed, analytical reports from IT when we have issues, which I have to say really isn’t that often. Of course, we’re a bunch of tech geeks, so we really want that kind of detail – perhaps you don’t need to tell your users which LUN you need to replace. Something like this is often just as good:
"Hello – by now you probably know the XYZ application is not working. It’s because of a hardware issue on one of the systems that runs the database. We have three people working on it right now, and if we can just replace the part, we’re thinking that it will be around ten minutes to be back in business. If we can’t, it could take longer, but I’ll let you know when I know. For now, you can use the TPS reports as a fallback for the data. Please don’t contact the helpdesk; they already know about the issue.
3. Tell people what their options are
One of the biggest frustrations the passengers had on my flight was a by-product of the "no information" problem. Once we landed, we didn’t know what to do to fix the tickets we had to flights that were now long gone. Eventually we stumbled onto what to do, but only after a lot of trouble and time. A simple "go stand in that line" would have been useful. Just like I put in that sample e-mail above, let people know what they can do until the system is up, even if it is just "you’ll have to wait on us. Sorry!" And while we’re on that topic, you do have a fallback as part of your disaster recovery process, correct? If not, time to do that NOW, while the systems are running and healthy.
So with these simple tips, you can avoid the problems the airlines created for themselves.