RE: The value of a little failure here and there

Although Bob seems to stop short of suggesting "intentional" outages, he seems to elude to creating them... but it's a good article nonetheless (minus the criminal behavior).

The value of a little failure here and there (IS Survivor)

I am categorically opposed to causing "intentional" outages to point out "problems" with a system, but I'll relate a humorous vignette nevertheless. (I have known the IT perpetrator personally for many years, so I can attest to its veracity.)

Once upon a time, an IT manager (whom we'll call Wilbur for the sake of humor value) started a new job at a new company. He discovered that his predecessor had left no documented service level agreements (SLAs). Upon interviewing his supervisors to determine what his SLAs should be for the various functions that his department provided to the business, Wilbur became concerned with the responses from senior management.

Senior management indicated that "email was not mission critical" and that they were willing to tolerate a day or two, maybe three, of downtime if that meant a reduction in operating cost and cost avoidance by not upgrading the Exchange servers. Based on Wilbur's professional experience, senior management was engaged in wishful thinking bordering on self-deception.

Hence Wilbur tested his hypothesis by turning off the email servers later that very week. I'm sure that you can imagine what happened next...

Within a few minutes, polite inquiries from throughout the company were forthcoming. They were politely answered with promises to investigate further and provide an update. Wilbur, of course, didn't need to check anything. He knew why the servers were down.

Within less than an hour, anxiety began to be evident in each communication (by phone or in person, since email was down) from senior management. Wilbur simply pointed out that he was in the process of checking things out, with a gentle reminder that his SLA for email recoverability was a minimum of 24 hours with an outside edge of 72 hours.

Within less than two hours, senior management was in full-throated panic. Invoices were expected. Customer communication was being delayed. Work throughout the organization had virtually stopped. Because email was down. Wilbur received agreement that perhaps a one to three day SLA for email recoverability... and announced that he was pretty close to figuring out what the problem was.

About ten minutes later, Wilbur turned the mail servers back on... and the people rejoiced. They also ponied up the budget required to bring the mail servers up to modern standards and buy the additional equipment required to make SLAs reasonable for 60-minute recoverability of function and 1-day of maximum data loss.

I would never recommend turning your boss' servers off just to prove a point, because we can usually do a little creative roleplay of a hypothetical outage and achieve the same goal. Feel free to read Wilbur's tale to your boss and ask "What if?"

Comments (2)

  1. Reed Me says:

    Not exactly phoning in a correction to the record, he just wanted to make an observation or two. The

  2. Not exactly phoning in a correction to the record, he just wanted to make an observation or two. The

Skip to main content