What happened when the Test Case Explorer extension hit turbulence

As highlighted on www.visualstudio.com/news a really exciting  Visual Studio Team Services deployment occurred. Unfortunately it also resulted in one of our prototype extensions, Test Case Explorer, to break expectantly.

image

Everyone leaned into isolating, understanding and fixing the issue. Not a small feat considering that m&m (Mathias and Mattias) are part-time volunteers, living in a different time zone. We are using this outage to finding ways to innovate, chisel away constraints, and deliver continuous value, as encouraged by Donovan, in his definition of DevOps.

Special THANK YOU to Mattias Sköld, who fixed the issue and provided an overview of the symptom, problem and root cause.

storyline

  • 2016-01-26 19:16 PDT - We started receiving emails from customers and support.
  • 2016-01-27 04:38 PDT - Mattias started initial investigations, confirmed the issue and recommended a resolution.
  • 2016-01-27 15:07 PDT - We started seeing CI builds failing on our dashboard, as Mattias was working on the fix.
    image
  • 2016-01-27 19:11 PDT - Build succeeds. Note that this was 4AM in Mattias’ time zone.
  • 2016-01-28 08:20 PDT - We completed our validations and published the v1.1.4 revision.

symptom

Extension crashing with error Cannot use property “Splitter” on null.
clip_image001

problem and root cause

  • Extension referred to the CommonControls.Splitter, which had moved to Controls.Splitter between m90 and m94.
  • The extension prototype was developed as the extensibility framework and associated libraries were in flux. While we tried our best to keep track on what was marked/gave warning as obsolete, it is possible that we missed a warning or the module/namespace has been post-m90.

our take-away(s)

We need to explore ways to instrument client for unhandled exceptions and actively monitor requests that are failing on the VSTS server side. At the same time, we need to continue investing in a continuous pipeline for all our tooling and exception projects, so that we can predict and deal with turbulence pro-actively … or even better, without you, the user, noticing.

Watch this space!