Now that we have had time to dive into the details more on what happened that caused the temporary problems last weekend, I want to share what we found so far. We still have more work to do, but we now understand what occurred and why. We've also implemented several changes to prevent this kind of thing from happening again.
What exactly happened?
Two key things happened. First, activations and validations were both affected when preproduction code was accidentally sent to production servers. Second, while the issue affecting activations was fixed in less than thirty minutes (by rolling back the changes) the effect of the preproduction code on our validation service continued after the rollback took place.
How did this happen in the first place?
Nothing more than human error started it all. Pre-production code was sent to production servers. The production servers had not yet been upgraded with a recent change to enable stronger encryption/decryption of product keys during the activation and validation processes. The result of this is that the production servers declined activation and validation requests that should have passed.
Why did it take so long to fix?
While the response to the activation issue was quick (less than thirty minutes) the effect on our validation service continued even after the rollback took place. We expected the rollback to fix both issues at the same time but we now realize that we didn't have the right monitoring in place to be sure the fixes had the intended effect.
If the servers are down, why don't you just assume the systems are genuine?
We do. It's important to clarify that this event was not an outage. Our system is designed to default to genuine if the service is disrupted or unavailable. In other words, we designed WGA to give the benefit of the doubt to our customers. If our servers are down, your system will pass validation every time. This event was not the same as an outage because in this case the trusted source of validations itself responded incorrectly.
What changes have you made?
We have implemented several changes to address the specific issues that took place over the weekend - for example we are improving our monitoring capabilities to alert us much sooner should anything like this happen again. We're also working through a list of additional changes such as increasing the speed of escalations and adding checkpoints before changes can be made to production servers.
Why were some customers told that this problem might continue for days?
As I mentioned in my post yesterday, we erroneously said the servers might be down until Tuesday, when in fact they had already been fixed as of late Saturday morning Pacific Time. We're reviewing our procedures on that score as well - communicating clearly and accurately are super important when things like this happen.
What were customers experiencing?
For the customers who failed validation from Friday afternoon through Saturday morning the experience was that features we refer to as ‘genuine-only' features were disabled. These features are Windows Aero, Windows ReadyBoost, Windows Defender (in this state Defender will scan and identify all threats it would ordinarily, but will only clean ones marked ‘severe') and Windows Update (in this state only ‘optional' updates are unavailable, all others can still be downloaded, including security updates). Also a desktop message appears in the lower right hand corner of the desktop area. The message reads ‘This copy of Windows is not genuine' and the message is persistent until a successful validation is performed and the message goes away.
The form of validation failure experienced by those affected on late Friday and early Saturday DID NOT result in the beginning of the 30-day grace period during which activation is required. Nor was there any 3-day period during which a customer was required to do anything related to this issue. Disabling the genuine-only features is meant to provide notice to the customer of the state of the system. When disabled, the features present their own error messages relating to the system not being genuine. It's unfortunate this happened to users with genuine systems.
I also want everyone to know that I am personally very disappointed that this event occurred. As an organization we've come a long way since this program began and it's difficult knowing that this event confused, inconvenienced, and upset our customers.
As always, please send your feedback to me through the blog (you can use the email link in the upper left hand corner of the page) or post comments.