A Quick Reference to Azure Diagnostics

If you’ve used Azure much, you may have eventually decided to use the DiagnosticsAgent plugin, as I did. However, you may also be dissatisfied (personally I am) with the amount of detail that’s out there about how it all works, what it normally does, etc. So I’m going to take a point-in-time snapshot of diagnostics agent files on my VM and poke through them and record what I find interesting and important, so that I or anyone can look it up later as quick reference (as with most of my blog posts). [Since I work at MS I should also mention this post is a personal project, not official documentation, and this may even be completely wrong.  End disclaimers.]

So... what actually happened when I add the diagnostics plugin to my worker role? Well, it’s going to depend a bit on what configuration I do:

1) I need to add some configuration for it in cscfg. Actually, the only bit of configuration which goes in cscfg is a ‘diagnostics storage account’.

2) I can add some more configuration for it in a diagnostics.wadcfg file in my cloud service project’s role. That might look something like this:

<?xml version="1.0" encoding="utf-8"?>
< DiagnosticMonitorConfiguration configurationChangePollInterval="PT1M" overallQuotaInMB="4096" xmlns="https://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">
  < DiagnosticInfrastructureLogs />
  <Directories>
    < IISLogs container="wad-iis-logfiles" />
    < CrashDumps container="wad-crash-dumps" />
  </Directories>
 
  <!-- Note that the fastest scheduled transfer can go is 1 minute -->
  <Logs bufferQuotaInMB="1024" scheduledTransferPeriod="PT1M" scheduledTransferLogLevelFilter="Verbose" />
  < WindowsEventLog bufferQuotaInMB="1024" scheduledTransferPeriod="PT1M" scheduledTransferLogLevelFilter="Error">
    < DataSource name="Application!*" />
  </WindowsEventLog>
  <PerformanceCounters bufferQuotaInMB="512" scheduledTransferPeriod="PT5M">
    < PerformanceCounterConfiguration counterSpecifier="\Memory\Available Bytes" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\Memory\Committed Bytes" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\Memory\Page Faults/sec" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\Processor(_Total)\% Processor Time" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\Processor(_Total)\% User Time" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\.NET CLR Memory(_Global_)\# Bytes in all Heaps" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\Process(WaWorkerHost)\% Processor Time" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\Process(WaWorkerHost)\Private Bytes" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\Process(WaWorkerHost)\Thread Count" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\IPv4\Datagrams Forwarded/sec" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\IPv4\Datagrams Received/sec" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\IPv4\Datagrams Sent/sec" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\IPv6\Datagrams Forwarded/sec" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\IPv6\Datagrams Received/sec" sampleRate="PT1M" />
    < PerformanceCounterConfiguration counterSpecifier="\IPv6\Datagrams Sent/sec" sampleRate="PT1M" />
  </PerformanceCounters>
< /DiagnosticMonitorConfiguration>

If you are learning about diagnostics, you can notice a bunch of things about it from this file:

  • you can set a ‘configuration change poll interval’. More on this later.
  • there’s a quota of how much storage to use
  • logs generated by something are transferred on a ‘scheduled transfer period’
  • windows event logs for configurable source can be collected, and processed again according to a ‘scheduled transfer period’
  • you can pick performance counters off of the machine, by name, which you want automatically logged
  • it uses the XML timespan format PT1M and PT5M everywhere to mean ‘one minute’ and ‘five minutes’

And that noticing raises an important question: what are the <Logs> and who generates them?

Well, turns out it’s really easy to generate logs. You can use the System.Diagnostics.TraceListener to collect logs you generate in code.

For that to work, you need either some code or some configuration in your role.dll.config or app.config that creates the Azure Diagnostics TraceListener and adds it to the trace listener collection.

< system.diagnostics>
  <trace autoflush="true">
    <listeners>
      <add type="Microsoft.WindowsAzure.Diagnostics.DiagnosticMonitorTraceListener, Microsoft.WindowsAzure.Diagnostics, Version=2.2.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35" name="AzureDiagnostics">
        <filter type="" />
      </add>
    </listeners>
  </trace>
< /system.diagnostics>

(Note that it’s using the Version 2.2.0.0 assembly because that’s the version of the Azure SDK we’re using to build and deploy the cloud service project, and so we can expect that is the right assembly version for what’s installed on the machine.)

OK, so that’s Logs.
So Azure Diagnostics trace listener is going to collect all your logs and performance counters and… do what with them?
So you knew the answer to that right? It’s going to (eventually) shove them Azure Tables.

The table names are governed by convention, and I don’t know a way to reconfigure them. Which annoys me. But anyway, let’s accept that as a given for now. They are:

-WADLogsTable
-WADPerformanceCountersTable

You don’t need to do anything to create these tables, they will be automatically created in your storage account for you once your diagnostics are running.

See I’m jumping ahead. Because we’re now done configuring, you want to deploy your role and run it, right? So what actually happens then?

1) Azure runtime starts up the diagnostics agent plugin. Result of this is that DiagnosticsAgent.exe is launched, and it receives the diagnostics storage account info and .wadcfg that you configured as its initial configuration.
2) DiagnosticsAgent.exe sets up ‘configuration polling’ which monitors something for configuration changes. More about something later.
3) DiagnosticsAgent.exe launches the MonAgentHost.exe which I believe is the actual monitoring agent that polls your performance counters and logs, and transfers data to the table storage. MonAgentHost appears to be unmanaged code.

Now this configuration polling thing is interesting. It’s possible to change the diagnostics configuration after your role is running. Basically the way you do this is you use the diagnostics configuration API ‘Microsoft.WindowsAzure.Diagnostics.Management’.

In order to make changes using this API the basic flow you follow is:
1) get a RoleInstanceDiagnosticManager object that points at your particular role in that particular deployment
2) call its SetCurrentConfiguration() method

What this then actually does it goes and updates (or creates) a specially named configuration blob in your diagnostics storage account, in the ‘wad-control-container’. It turns out that DiagnosticsAgent.exe has been pinging storage to try and read this configuration blob every minute or however often you set its configuration polling for. Once there’s new config here, that causes the diagnostics system to try to update itself with the new config.

The slightly cool thing about this system is that you can use it to update the diagnostics configuration of any role at any time from anywhere. As long as you have permissions to update the blob in that container, of course. Your role can use this API to update its own config too, if that’s what you want, or perhaps you greatly prefer code-based configuration to .wadcfg XML.

OK.

So now my diagnostics are happily running and getting reconfigured and doing exactly what I want, right? Wait… what is it doing?

Well,
1) Your app edm, and turns them into ETW events in the Windows event log
2) It collects

They are just rows in Azure Tables?
Yes, well I mentioned that before, when I wrote a quick tool for retrieving Azure Diagnostics logs.

But of course if you want to do anything useful with large amounts of log data you’re going to want to do some smarter analysis of the data… and so you’re going to want to know the schema of the table. Which I can’t find officially documented. Grr.

So: here is the schema as mainly derived from the internet… (references:

https://stackoverflow.com/questions/5737343/how-to-filter-azure-logs-or-wcf-data-services-filters-for-dummies
https://blog.yaplex.com/azure/windows-azure-logging/
https://gauravmantri.com/2012/02/17/effective-way-of-fetching-diagnostics-data-from-windows-azure-diagnostics-table-hint-use-partitionkey/
https://alexandrebrisebois.wordpress.com/2013/10/13/i-take-it-back-use-windows-azure-diagnostics/

 

The things that obviously matter for querying table storage are PartitionKey, RowKey, since they are what table storage can performantly query. Also Timestamp/EventTickCount matters to some degree, but you don’t want to be querying by Timestamp if you can query by PartitionKey+RowKey instead.

Consensus is that the way PartitionKey is generated is from DateTime.Ticks, plus a bunch of zeros:

0635313933600000000

You’ll note that the values here tends to deviate slightly from the one in Timestamp.
Off the top of my head, I believe RowKey was just an arbitrary index value to ensure all entries in the partition have a unique key.

I’m of the belief that the difference is because EventTickCount is the timestamp of the actual log entry when its created on the machine. PartitionKey, on the other hand is either a rounded timestamp, or a timestamp of when it was uploaded by the monitoring agent.

Now remember, PartitionKeys and RowKeys are strings. So you don’t need to query for PartitionKey > 0635313933600000000, instead you can just search for > 06353139336.

Aside from PartitionKey and RowKey, the table entries for WADLogsTable are:

public long EventTickCount { get; set; }
public string DeploymentId { get; set; }
public string Role { get; set; }
public string RoleInstance { get; set; }
public int EventId { get; set; }
public int Level { get; set; }
public int Pid { get; set; }
public int Tid { get; set; }
public string Message { get; set; }