I’ve been working in IT for 25 years, both inside and outside of Microsoft, at companies of various sizes. While the scale of IT operations changes with each company, there’s a lot that’s common. I was recently reminded of this point while working with our business continuity team. No matter where you work in IT, being prepared for disasters is a common – and critical – business operation.
Today I oversee a worldwide engineering and services management group within Microsoft’s IT department. We’re part of a big IT operation, running systems that are a big target for would-be hackers, and systems that connect 180,000+ users. Customer delight – both internal and external customers -- is a big, important outcome for us.
Yet, I wasn’t surprised when I read a survey that showed many businesses treated as common such big issues as IT downtime and disaster recovery. Whether you’re running IT systems completely in-house, via a service provider, or like us a combination of the two models, uptime and reliability are mandatory. We all know that IT systems power customer connections, internal collaboration, marketing campaigns, financial reporting and more. Business operations rely on IT systems. As one columnist noted,
Most IT organizations are running flat out these days so it’s difficult to put a contingency plan in place when all hands are needed on deck every day. Nevertheless, disasters both natural and otherwise loom day and night, so just remember that old adage about an ounce of IT prevention.
An Ounce of IT Prevention
On March 11, 2011, the Tohoku earthquake in eastern Japan produced an incomprehensible crisis for the people of Japan and organizations with operations and customers in Japan. I won’t try to summarize the extent of the physical and emotional damage in this blog. But this event can serve as a backdrop to share my professional learnings around business continuity. Unfortunately, it’s events like this that make us reflect on our own business.
For me, business continuity comes down to three critical components: processes, planning, and preparation. These three components are the foundation of our continuity plans, and helped us manage disruptions resulting from the Tohoku earthquake.
Processes. The first step is to identify critical business processes, such as revenue, legal and regulatory, workforce, and others. While doing so, also identify the critical applications that support these critical processes and the critical people that make it all happen.
Planning. Create a business continuity plan for each process and application. While doing so, create partnerships with operations groups, such as security, human resources, facilities, legal, PR and others.
Preparation. Schedule simulations on a regular basis. Our simulations include leaders and individual contributors who are given specific disaster scenarios during regular half-day sessions.
After the earthquake and tsunami in Japan, we worked closely with other Microsoft operations teams in Redmond, Japan and Asia Pacific to understand the safety of our employees in the region. We were able to account for all employees within 48 hours.
In parallel, we focused on continuing to support our partners and serve our customers in that region. The disaster impacted more traditional communications links, causing people to heavily rely on online-services to keep in touch. In this instance, free services such as email and instant messenger could have a different priority in a disaster. This meant working closely with our internal cloud infrastructure service provider team, Global Foundation Services, who manages our datacenter in Japan that serves Microsoft offices and employees, as well as partners and customers. Global Foundation Services maintained a 24-hour-a-day open phone line, with regular updates, to help us understand the local situation and react accordingly.
A specific consideration in this instance was the requirement to protect our staff from potential harm in the wake of the nuclear power reactors damages and radiation leaks. We were able to remotely manage the production environment via IT software tools. We were able to identify all the business processes and applications running in this local datacenter. This process allowed us to make informed decisions of how to keep the business running and help our staff in the region.
We experienced no major service impact during the aftermath of the earthquake and tsunami. Resolving regional network, undersea cabling, rolling power outages, and slow connection problems caused by them were a high priority. We had to reduce the load on the local grid because the capacity was greatly diminished by the nuclear power plant damages and rolling power blackouts. Therefore our business continuity plans included the powering down of non-critical infrastructure. We also had to rebuild our Asia network backbone over the weekend by routing away from Japan so we could maintain a fast connection between our Redmond headquarters and Asian offices. It normally takes 3-6 months to get a network link running across the Pacific Ocean. Due to our strong partnerships, we had advanced plans in place already with carriers in case undersea cabling should become compromised.
Our planning allowed us to also offer assistance to external customers and partners. Those impacted by the earthquake were provided with free incident support to help get their operations back up and running. We used our guest wireless access in/around the local datacenter to offer Internet access to local community service organizations as well. We shared best practices with customers and partners on how to power down servers during the rolling blackouts.
I’ll leave you with one additional insight, one that isn’t tied to the events in Japan. When we first began to identify the critical business continuity processes across the company, our business partners struggled to apply first, second and third priority to processes. For them almost all processes were critical. This mindset changed, however, when we put a price on the cost of business continuance for each business process. Exposing the price for these processes changed the conversation, and it’s now part of the conversation up front.
Perhaps it’s time for me to change my thinking to the four Ps of business continuity.