Keeping diagnostics in sync across Windows Azure instances

Windows Azure Diagnostics provides a great way for operations staff, developers and testers to find out what’s going on within a Windows Azure PaaS (Cloud Service) deployment. In a nutshell, it lets you specify what types of logs you are interested in (application logs, event logs, performance counters, or log files), and at specified intervals these logs are transferred from your Windows Azure VMs to blobs and tables in your Windows Azure Storage Account.

To specify what diagnostics you’re interested in, you normally include a file called diagnostics.wadcfg within each role of your Cloud Service (or, you can also do it programmatically). This file specifies your “factory default” diagnostics settings and gets baked into your deployment package, so you can’t modify it without redeploying. However, as soon as your role instances start, the diagnostics settings are copied into a blob in the wad-control-container, which is what Windows Azure Diagnostics uses at runtime. There’s a separate blob for each instance of each role of each deployment, and this can be changed after deployment. In fact, if you have the new Windows Azure SDK 2.0 for .NET or third party tools like Cerebrata’s Azure Diagnostics Manager it’s very easy to update the live diagnostics from a GUI so you don’t need to hand-edit the control container blobs.

Even though it’s possible to set different diagnostics configuration for each instance of a role, in practice it’s pretty unlikely that you’ll want this. The people who built the aforementioned tools must agree because they also provide easy ways to update the diagnostics configuration of all instances of a role at the same time. However, consider what happens in the following situation:

  1. 2 instances of WebRole1 are deployed, and Windows Azure uses the “factory default” diagnostics configuration loaded from diagnostics.wadcfg.
  2. To improve operations processes, the diagnostics configuration for both instances of WebRole1 are modified to “new improved” diagnostics configuration using a tool
  3. Based on high user demand, the deployment is scaled (either automatically or manually) by adding a 3rd instance of WebRole1.

So what diagnostics configuration would you expect the new 3rd instance to have? Well, the result which you’d almost certainly want is for the 3rd instance’s diagnostics configuration to match the other two. However, when the new instance is deployed it will come with the “factory default” settings, and won’t get “new improved” settings unless someone explicitly applies them.

If this is a concern to you, the good news it’s very easy to change the behaviour with a bit of code, and the better news is that I’ve already written the code for you. Hopefully the code is pretty self-explanatory, but basically whenever a role instance starts, if that instance isn’t the first one (instance 0), the diagnostics configuration is copied from instance 0’s current version. You may wonder what happens if instance 0 is ever reimaged or fails over – it’s actually OK as its wad-control-container blob remains intact even when the new VM starts.

My code is below; if you need to do this across multiple roles or applications, of course you should put this into a reusable library.

 public class WebRole : RoleEntryPoint
{
    public override bool OnStart()
    {
        // For information on handling configuration changes
        // see the MSDN topic at https://go.microsoft.com/fwlink/?LinkId=166357.

        CopyDiagnosticsSettingsFromExistingInstance();

        return base.OnStart();
    }

    private static void CopyDiagnosticsSettingsFromExistingInstance()
    {
        var trace = new DiagnosticMonitorTraceListener();
        try
        {
            var sourceInstance = RoleEnvironment.CurrentRoleInstance.Role.Instances.OrderBy(i => i.Id).First();
            if (RoleEnvironment.CurrentRoleInstance.Id != sourceInstance.Id)
            {
                trace.WriteLine(String.Format("Copying live diagnostics settings from instance {0}.", sourceInstance.Id));
                var diagnosticsConnectionString = CloudConfigurationManager.GetSetting("Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString");
                var deploymentId = RoleEnvironment.DeploymentId;
                var deploymentDiagnosticsManager = new DeploymentDiagnosticManager(diagnosticsConnectionString, deploymentId);

                var sourceDiagnosticsManager = deploymentDiagnosticsManager.GetRoleInstanceDiagnosticManager(sourceInstance.Role.Name, sourceInstance.Id);
                var sourceConfig = sourceDiagnosticsManager.GetCurrentConfiguration();

                if (sourceConfig != null) // May happen during deployment when all instances are coming online
                {
                    var thisDiagnosticsManager = deploymentDiagnosticsManager.GetRoleInstanceDiagnosticManager(RoleEnvironment.CurrentRoleInstance.Role.Name, RoleEnvironment.CurrentRoleInstance.Id);
                    thisDiagnosticsManager.SetCurrentConfiguration(sourceConfig);
                }
            }
        }
        catch (Exception ex)
        {
            trace.WriteLine(String.Format("Error while copying diagnostics settings: {0}", ex.ToString()));
        }
    }
}

It’s critical that developers “design for operations” to ensure applications they develop can be kept running smoothly over time. Diagnostics and scalability are two aspects of designing for operations, and while Windows Azure provides great platform services for both, with this code in place these two concerns will work even better together. Please let me know if you find this useful or have any suggestions or questions.