Understanding WMI Job Expiration in Hyper-V

A long time ago – I blogged about how to handle WMI job objects in Hyper-V.  The short story here is that if you script or program against Hyper-V – chances are you will need to look at WMI jobs to find out whether requested operations have completed (operations like starting a virtual machine, creating a virtual hard disk or taking a snapshot all return their status through job objects).

When we first started implementing WMI job objects, we built the system to update the status of a WMI job and to delete the job when the operation was completed.  There is an obvious flaw here: it was never possible to tell if an operation succeeded or failed, because the moment the operation completed (successfully or not) we would set the status on the WMI job and then immediately delete it.  Whoops.

To deal with this we came up with a simple solution: once an operation is completed we maintain the WMI job in memory for 5 minutes before we delete it.  This gives any programs enough time to check the status and act appropriately.  However, this introduced a new problem.  Each of those WMI job objects took up memory to maintain – and on a large, active system we could waste a lot of memory keeping them around.  So a second change was made: if we find that we have over 4096 WMI job objects in memory – we start expiring the oldest jobs to reduce memory usage.

For the most part – this system works just fine.  However – there are a couple of cases where I hear from people who accidentally hit WMI job expiration problems.  Two specific examples that come to mind are:

  • “I got a coffee, and when I came back things were all confused”

    Most of the time I hear this from developers – who decided to take a coffee break at just the wrong time in testing their code, which resulted in their code failing to find a job object that had exceeded its 5 minute time out.  There are a few places in our UI where you can hit this (specifically in our wizards) where if you sit on the right page for long enough – we will be very confused when you move to the next page.
  • “I was doing bulk operations, and started getting missing job operations”

    In this case I hear from people who have written a script to perform a bulk operation (like creating 4,000 virtual hard disks as quickly as possible).  What can happen here is that not only will their script / program have to deal with missing job objects, but if they are using the Hyper-V UI to perform other actions at the same time – it can get confused because jobs are disappearing.

Now, for the most part the Hyper-V UI attempts to handle WMI Job expiration has elegantly as possible – and will just report that we cannot find the job associated with the requested operation and direct you to look at the event log for details.  But it is something that should be considered if you are writing your own code against our interfaces – jobs are not available eternally, and under load they can disappear quite quickly.