Leveraging Event Log Messages and Performance Counter Alerts To Automate Hyper-V

imageA few times I’ve had people ask me about how they automate moving VM’s off of a cluster node based on some type of detected failure or performance issue.  In general my answer is to utilize System Center specifically Operations Manager – many customers I work with take this advice and leverage the management packs and broad insight that Operations Manager provides to create automatic Diagnostics and Recoveries tied to Monitors.  But for other customers this is to heavy handed or they

don’t have the resources to leverage Operations Manager.  So are they out of luck?  Not really – there are two commonly overlooked feature in Windows, the Task Scheduler and Performance Monitor.

imageWithin the Windows Task Scheduler you can create a task which is automatically executed every time an event is generated the task can then run a script.  For example you can create a task which is tied to the event that is logged when a network adapter that is connected to a virtual switch has it’s network cable disconnected, when that event is detected you can have a script that pauses the cluster node and migrates all of the VM’s off of it.  Similarly you can configure the Windows Performance Monitor to trigger a scheduled task when a performance threshold has been exceeded.  What actions these tasks take is completely up to your imagination and the how complex you want to make the scripts they execute.

The biggest draw back to this method vs using something like Operations Manager is that magic term ‘centralization’ i.e. these tasks have to be configured on each server and any changes have to be replicated to each server.  Additionally either the scripts have to be flexible enough work across varying server configurations (for example one NIC or two, teamed or not) or you have to configure the script/task individually for each server.  However going back to our original statement this is really an answer for environments where Operations Manger is too heavy thus limited number of servers.

I’m going to provide a few basic examples just to get you thinking and as a proof of concept.  Specifically I am going to focus on the disconnected network adapter and some overall performance metrics (CPU utilization, Disk throughput and Network throughput).  We’ll start with the disconnected network adapter.

Creating a Task Triggered By Events

image

Creating the task

  1. Start by opening the Task Scheduler and then create a new task
  2. Provide the task a name i.e. Disconnected NIC
  3. Select an account for the task to run under it will need to be an Administrator account for this example to work – I use a domain account that I provision to all my servers for this specific reason, and yes I change the password often and the account is significantly limited.
  4. Select “Run whether user is logged on or not”
  5. Select “Run with highest privileges”
image

Defining the triggers

We will define two triggers the first for event id 22 which represents the media being disconnected and a second for event id 24 which represents the underlying nic being disabled or otherwise failing.

  1. Select the triggers tab and click New
  2. Select “On an event” from the Begin the task list
  3. From the Log dropdown select System
  4. From the source dropdown Select “Hyper-V-VMSwitch
  5. In the Event ID field specify 22
  6. Repeat steps 1-5 specifying 24 as the event id for step 5.
image

Defining the actions

In this example there is just a single action – the action runs a PowerShell script.  You could have multiple actions for example one that runs a script to live migrate VMs and a second to send an e-mail or startup another host what ever you want.

  1. Select the actions tab and click New
  2. For the action “Start a program” should already be selected
  3. For the Program/script property specify “%SystemRoot%\system32\WindowsPowerShell\v1.0\powershell.exe” to initiate PowerShell
  4. For the Add arguments property specify the path to the script you want executed when this event occurs.  The script could be as simple as “Suspend-ClusterNode –Drain”.  There is a more in-depth example of a script below as well.
  1.  

Creating a Performance Monitor Alert To Trigger a Task

 

image

Creating a Data Collector Set

  1. Start by opening Performance Monitor (perfmon)
  2. Expand the “Data Collector Sets”
  3. Right click on “User Defined” and select New->Data Collector Set
  4. Provide a name for the collector set
  5. Select “Create manually (Advanced)” and select Next
  6. Select “Performance Counter Alert” and select Finish
image

Creating a Data Collector

  1. Right click on the newly created Collector Set and select New->Data Collector
  2. Specify a name and select “Performance counter alert”
image

Defining the counters and thresholds

  1. Click Add
  2. Select the counter or counters you are interested in alerting on, in this example I am looking at CPU utilization, available memory and Avg. Disk Queue Length.  It’s worth noting that if multiple conditions are true the alert will only specify the first condition, you can get around this if required by defining multiple alerts.
  3. For each counter you specify provide an alert threshold, i.e. for CPU utilization I specified Idle time below 10% for Memory below 8096 Available MBs etc..
  4. Select Next and then Check the “Open properties for this data collector” and Finish
image

Defining the counter interval and task to run

  1. In the Data Collector Properties on the Alerts Tab you can specify the sample interval – do keep in mind the more often you query the counter the more CPU is required, 15 seconds seems like a good starting point.
  2. On the Alert Task tab specify a name you will use for a Task Manager task (we’ll create the task in a min)
  3. You can also specify and arguments you want passed to the task.
  4. When done click ok…
image

Creating the scheduled task

  1. Create a task using the same method as the event triggered tasks.  Ensure you name it appropriately based on the Alert Task name you provided when defining the performance alert.
  2. For the triggers you can leave that blank – the performance alert is the trigger.
  3. For the action they will look very similar to the event action however note that for the arguments in addition to the script I have specified ‘$(Arg0)’.  This is what passes the arguments from the performance alert to the script – also worth calling out since it took me a while to figure it out the single quotes are required…
  4. Another worthwhile tip, on the settings tab you may want to select Do not start a new instance for the If the task is already running option.  Otherwise it will start your script every 15 seconds (or what ever interval you specified)
  5. In terms of the script actions that’s up to you… For a basic test I just had it output the arguments to a file ($args | Out-File "C:\Scripts\PerformanceAlert.txt" –Append) which will give you some output like: \Hyper-V Hypervisor Logical Processor(_Total)\% Idle Time;< 95.000000;90.611417;9/25/2012 - 4:04:25 PM

More Robust Disconnected NIC Script

This is a more robust script for the disconnected host NIC – what the script does is identifies which NIC is disconnected and any VM’s that are impacted by that NIC being disconnected.  It then checks all of the other cluster nodes to see if there NIC is also disconnected before migrating only the effected VM’s.

$NicDisconnectedLog = [String]::Empty
$NicDisconnectedLog += "NIC Adapter Disconnection Detected - Attempting to Move VMs`n"

$Event = Get-EventLog -LogName System -Source Microsoft-Windows-Hyper-V-VmSwitch `
-InstanceId 24 -Newest 1

$NicDisconnectedLog += ("Disconnected NIC Description: " + `
$Event.ReplacementStrings[3] + "`n")

$Switch = Get-VMSwitch -SwitchType External | Where-Object `
{$_.NetAdapterInterfaceDescription -eq $Event.ReplacementStrings[3]}

$NicDisconnectedLog += ("Associated Switch Name: " + $Switch.Name + "`n")

$NicDisconnectedLog += ("Determining Available Cluster Nodes`n")
$AvalableClusterNodes = @()
foreach ($clusterNode in (Get-ClusterNode | Where-Object {$_.State -eq "Up"}))
{
$destSwitch = Get-VMSwitch -Name $Switch.Name -ComputerName $clusterNode.Name
if ((Get-NetAdapter -InterfaceDescription $destSwitch.NetAdapterInterfaceDescription `
-CimSession $clusterNode.Name).Status -eq "Up")
{
$AvalableClusterNodes+= $clusterNode
$NicDisconnectedLog += ("Node: " + $clusterNode.Name + " is available.`n")
}
else
{
$NicDisconnectedLog += ("Node: " + $clusterNode.Name + `
" also has a disconnected switch.`n")
}
}
if ($AvalableClusterNodes.Count -eq 0)
{
$NicDisconnectedLog += ("No Available Cluster Nodes - Exiting`n")
}
else
{

    $NicDisconnectedLog += ("Determining Effected VM's`n")
$EffectedNics = Get-VMNetworkAdapter -VMName * | Where-Object `
{$_.SwitchName -eq $Switch.Name}
$VMsToMove = @()
foreach ($Nic in $EffectedNics)
{
if (!$VMsToMove.Contains($Nic.VMId))
{
$NicDisconnectedLog += ("VM: " + $Nic.VMName + "ID:(" + `
$Nic.VMId + ")" +" is effected - preparing to move.`n")       

            $VMsToMove += $Nic.VMId
}
}

    $NicDisconnectedLog += ("Preparing to Move Effected VM's`n")
for ($MoveCounter = $VMsToMove.Count; $MoveCounter -gt 0; $MoveCounter--)
{
$attemptCount = 0

do {
$destinationNode = $AvalableClusterNodes[(($MoveCounter+$attemptCount) `
% $AvalableClusterNodes.Count)]
$NicDisconnectedLog += ("Moving VM with ID: " + $VMsToMove[($MoveCounter-1)] `
+ " to node: " + $destinationNode.Name + "`n")               

            $result = Move-ClusterVirtualMachineRole -VMId $VMsToMove[($MoveCounter-1)] `
-Node $destinationNode.Name -MigrationType Live

            $attemptCount++
}
while (($result.OwnerNode -ne $destinationNode.Name) `
-and ($attemptCount -lt $AvalableClusterNodes.Count))
}
}

$NicDisconnectedLog | Out-File "C:\Scripts\NicLog.txt"

 

 

-taylorb