Troubleshooting Scenario 7 – Role Recycling

In Troubleshooting Scenario 1 we looked at a scenario where the role would recycle after deployment and the root cause was easily seen in the Windows Azure Event Logs. This blog post will show another example of this same type of behavior, but with a different, and more difficult to find, root cause. This is a continuation of the troubleshooting series.

Symptom

You have deployed your Azure hosted service and it shows as Recycling in the portal. But there is no additional information such as an Exception type or error message. The role status in the portal might switch between a few different messages such as, but not limited to:

  • Recycling (Waiting for role to start... System startup tasks are running.
  • Recycling (Waiting for role to start... Sites are being deployed.
  • Recycling (Role has encountered an error and has stopped. Sites were deployed.

Get the Big Picture

Similar to the previous troubleshooting scenarios we want to get a quick idea of where we are failing. Watching task manager we see that WaIISHost.exe starts for a few seconds and then disappears along with WaHostBootstrapper.

image

From the ‘Get the Big Picture’ section in Troubleshooting Scenario 1 we know that if we see WaIISHost (or WaWorkerHost) then the problem is most likely a bug in our code which is throwing an exception and that the Windows Azure and Application Event logs are a good place to start.

Check the logs

Looking at the Windows Azure event logs we don’t see any errors. The logs show that the guest agent finishes initializing (event ID 20001), starts a startup task (10001), successfully finishes a start task (10002), then IIS configurator sets up IIS (10003 and 10004), and then the guest agent initializes itself again and repeats the loop. No obvious errors or anything to indicate a problem other than the fact that we keep repeating this cycle a couple times per minute.

image

Next we will check the Application event logs to see if there is anything interesting there.

The Application event logs are even less interesting. There is virtually nothing in there, and certainly nothing in there that would correlate to an application failing every 30 seconds.

image

As we have done in the previous troubleshooting scenarios we can check some of the other commonly used logs to see if anything interesting shows up.

WaHostBootstrapper logs
If we check the C:\Resources folder we will see several WaHostBoostrapper.log.old.{index} files. WaHostBootstrapper.exe creates a new log file (and archives the previous one) every time it starts up, so based on what we were seeing in task manager and the Windows Azure event logs then it makes sense to see lots of these host bootstrapper log files. When looking at the host bootstrapper log file for a recycling role we want to look at one of the archived files rather than the current WaHostBootstrapper.log file. The reason is because the current file is still being written so depending on when you open the file it could be at any point in the startup process (ie. starting a startup task) and most likely won’t have any information about the crash or error which ultimately causes the processes to shut down. You can typically pick any of the .log.old files, but I usually start with the most recent one.

The host bootstrapper log starts off normally and we can see all of the startup tasks executing and returning with a 0 (success) return code. The log file ends like this:

[00002916:00001744, 2013/10/02, 22:09:30.660, INFO ] Getting status from client WaIISHost.exe (2976).
[00002916:00001744, 2013/10/02, 22:09:30.660, INFO ] Client reported status 1.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client DiagnosticsAgent.exe (1788).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] Failed to connect to client DiagnosticsAgent.exe (1788).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] <- CRuntimeClient::OnRoleStatusCallback(0x00000035CFE86EF0) =0x800706ba
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client DiagnosticsAgent.exe (3752).
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Client reported status 0.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client RemoteAccessAgent.exe (2596).
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Client reported status 0.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client RemoteAccessAgent.exe (3120).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] Failed to connect to client RemoteAccessAgent.exe (3120).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] <- CRuntimeClient::OnRoleStatusCallback(0x00000035CFE86E00) =0x800706ba
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client WaIISHost.exe (2976).
[00002916:00001744, 2013/10/02, 22:09:31.300, INFO ] Client reported status 2.

No error messages or failures (remember from scenario 2 that we can ignore the ‘Failed to connect to client’ and 0x800706ba errors), just a status value of 2 from WaIISHost.exe. The status is defined as an enum with the following values:

0 = Healthy
1 = Unhealthy
2 = Busy

We would typically expect to see a 1 (Unhealthy while the role is starting up), then a 2 (Busy while the role is running startup code), and then a 0 once the role is running in the Run() method. So this host bootstrapper log file is basically just telling us that the role is in the Busy state while starting up and then disappears, which is pretty much what we already knew.

WindowsAzureGuestAgent logs

Once WaIISHost.exe starts up then the guest agent is pretty much out of the picture so we won’t expect to find anything in these logs, but since we haven’t found anything else useful it is good to take a quick look at these logs to see if anything stands out. When looking at multiple log files, especially for role recycling scenarios, I typically find one point in time when I know the problem happened and use that consistent time period to look across all logs. This helps prevent just aimlessly looking through huge log files hoping that something jumps out. In this case I will use the timestamp 2013/10/02, 22:09:31.300 since that is the last entry in the host bootstrapper log file.

AppAgentRuntime.log

[00002608:00003620, 2013/10/02, 22:09:21.789, INFO ] Role process with id 2916 is successfully resumed
[00002608:00003620, 2013/10/02, 22:09:21.789, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateSuspended to RoleStateBusy.
[00002608:00001840, 2013/10/02, 22:09:29.566, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateBusy to RoleStateUnhealthy.
[00002608:00003244, 2013/10/02, 22:09:31.300, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateUnhealthy to RoleStateBusy.
[00002608:00003620, 2013/10/02, 22:09:31.535, FATAL] Role process exited with exit code of 0
[00002608:00003620, 2013/10/02, 22:09:31.613, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateBusy to RoleStateStopping.
[00002608:00003620, 2013/10/02, 22:09:31.613, INFO ] Waiting for ping from LB.
[00002608:00003620, 2013/10/02, 22:09:31.613, INFO ] TIMED OUT waiting for LB ping. Proceeding to stop the role.
[00002608:00003620, 2013/10/02, 22:09:31.613, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateStopping to RoleStateStopped.

We can see the WaHostBootstrapper process starting (PID 2916, which matches the PID:TID we see in the WaHostBootstrapper.log – {00002916:00001744}). Then we see the role status change to Busy, Unhealthy, then Busy, which is exactly what we see in the host bootstrapper log file. Then the role process exits and the guest agent proceeds to do a normal stop role and then start role. So nothing useful in this log.

Debugging

At this point we have looked at all of the useful logs and have not found any indication of what the source of the problem might be. Now it is time to do a live debug session in order to find out why WaIISHost.exe is shutting down.

The easiest way to start debugging on an Azure VM is with AzureTools. You can learn more about AzureTools and how to download it from https://blogs.msdn.com/b/kwill/archive/2013/08/26/azuretools-the-diagnostic-utility-used-by-the-windows-azure-developer-support-team.aspx.

First we want to download AzureTools and then double-click the X64 Debuggers tool which will download and install the Debugging Tools for Windows which contains WinDBG.

image

Now we have to get WinDBG attached to WaIISHost.exe. Typically when debugging a process you can just start WinDBG and go to File –> Attach to a Process and select the process from the list, but in this case WaIISHost.exe is crashing immediately on startup so it won’t show up in the currently running process list. The typical way to attach to a process that is crashing on startup is to set the Image File Execution Options Debugger key to start and attach WinDBG as soon as the process starts. Unfortunately this solution doesn’t work in an Azure VM (for various reasons) so we have to come up with a new way to attach a debugger.

AzureTools includes an option under the Utils tab to attach a debugger to the startup of a process. Switch to the Utils tab, click Attach Debugger, select WaIISHost from the process list, then click Attach Debugger. You will see WaIISHost show up in the Currently Monitoring list. AzureTools will attach WinDBG (or whatever you specify in Debugger Location) to a monitored process the next time that process starts up. Note that AzureTools will only attach the next instance of the target process that is started – if the process is currently running then AzureTools will ignore it.

image

Now we just wait for Azure to recycle the processes and start WaIISHost again. Once WaIISHost is started then AzureTools will attach WinDBG and you will see a screen like this:

image

Debugging an application, especially using a tool like WinDBG, is oftentimes more art than science. There are lots of articles that talk about how to use WinDBG, but Tess’s Debugging Demos series is a great place to start. Typically in these role recycling scenarios where there is no indication of why the role host process is exiting (ie. the event logs aren’t showing us an exception to look for) I will just hit ‘g’ to let the debugger go and see what happens when the process exits.

WinDBG produces lots of output, but here are the more interesting pieces of information:

Microsoft.WindowsAzure.ServiceRuntime Information: 100 : Role environment . INITIALIZING
[00000704:00003424, INFO ] Initializing runtime.
Microsoft.WindowsAzure.ServiceRuntime Information: 100 : Role environment . INITIALED RETURNED. HResult=0
Microsoft.WindowsAzure.ServiceRuntime Information: 101 : Role environment . INITIALIZED
ModLoad: 00000000`00bd0000 00000000`00bda000 E:\approot\bin\MissingDependency.dll
Microsoft.WindowsAzure.ServiceRuntime Critical: 201 : ModLoad: 000007ff`a7c00000 000007ff`a7d09000 D:\Windows\Microsoft.NET\Framework64\v4.0.30319\diasymreader.dll
Role entrypoint could not be created:
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.IO.FileNotFoundException: Could not load file or assembly 'Microsoft.Synchronization, Version=1.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91' or one of its dependencies. The system cannot find the file specified.
at MissingDependency.WebRole..ctor()
--- End of inner exception stack trace ---
at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandleInternal& ctor, Boolean& bNeedSecurityCheck)
at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean skipCheckThis, Boolean fillCache, StackCrawlMark& stackMark)
at System.RuntimeType.CreateInstanceDefaultCtor(Boolean publicOnly, Boolean skipCheckThis, Boolean fillCache, StackCrawlMark& stackMark)
at System.Activator.CreateInstance(Type type, Boolean nonPublic)
at System.Activator.CreateInstance(Type type)
at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetRoleEntryPoint(Assembly entryPointAssembly)
at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CreateRoleEntryPoint(RoleType roleTypeEnum)
at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.InitializeRoleInternal(RoleType roleTypeEnum)

  • The first 4 lines tell us that the Azure serviceruntime was initialized successfully. 
  • Line 5 shows that my role entry point (MissingDependency.dll - the WebRole.cs code such as the OnStart method) is loaded. This tells us that we are getting into custom code and the problem is probably not with Azure itself.
  • Line 6 is loading diasymreader.dll. This is the diagnostic symbol reader and you will see it loaded whenever a managed process throws a second chance exception. The fact that this comes shortly after loading my DLL tells me that it is probably something within my DLL that is causing a crash.
  • Line 7 “Role entrypoint could not be created: “ tells me that Azure (WaIISHost.exe) is trying to enumerate the types in the role entry point module (MissingDependency.dll) to find the class that inherits from RoleEntrypoint so that it knows where to call the OnStart and Run methods, but it failed for some reason.
  • The rest of the lines show the exception being raised which ultimately is causing the process to exit.

The exception message and callstack tell us that WaIISHost.exe was not able to find the Microsoft.Synchronization DLL when trying to load the role entry point class and running the code in MissingDependency.WebRole..ctor(). 

 

Intellitrace

The above shows how to do a live debug which has some nice benefits – no need to redeploy in order to troubleshoot so it can be much faster if you are experienced with debugging, and you are in a much richer debugging environment which is often required for the most complex problem types. But for issues such as role recycles it is often easier to turn on Intellitrace and redeploy. For more information about setting up and using Intellitrace see https://msdn.microsoft.com/en-us/library/windowsazure/ff683671.aspx or https://blogs.msdn.com/b/jnak/archive/2010/06/07/using-intellitrace-to-debug-windows-azure-cloud-services.aspx.

For this particular issue I redeployed the application with Intellitrace turned on and was quickly able to get to the root cause.

image

image
 

 

Solution

Typically once I think I have found the root cause of a problem I like to validate the fix directly within the VM before spending the time to fix the problem in the project and redeploy. This is especially valuable if there are multiple things wrong (ie. multiple dependent DLLs that are missing) so that you don’t spend a couple hours in a fix/redeploy cycle. See https://blogs.msdn.com/b/kwill/archive/2013/09/05/how-to-modify-a-running-azure-service.aspx for more information about making changes to an Azure service.

Applying the temporary fix:

  1. On your dev machine, check in Visual Studio to see where the DLL is on the development machine.
  2. Copy that DLL to the Azure VM into the same folder as the role entry point DLL (e:\approot\bin\MissingDependency.dll in this case).
  3. On the Azure VM, close WinDBG in order to let WaIISHost.exe finish shutting down which will then let Azure recycle the host processes and attempt to restart WaIISHost.

Validating the fix:

  • The easiest way to validate the fix is to just watch Task Manager to see if WaIISHost.exe starts and stays running.
  • You should also validate that the role reaches the Ready state. You can do this 3 different ways:
    • Check the portal. This may take a couple minutes for the HTML portal to reflect the current status.
    • Open C:\Logs\WaAppAgent.log and scroll to the end. You are looking for “reporting state Ready.”
    • Within AzureTools download the DebugView.zip tool. Run DebugView.exe and check Capture –> Capture GlobalWin32. You will now see the results of the app agent heartbeat checks in real time.

Applying the solution:

At this point we have validated that the only problem is the missing Microsoft.Synchronization.DLL so we can go to Visual Studio and mark that reference as CopyLocal=true and redeploy.