One Instance of the Role Just Recycles After SDK Upgrade

I was recently working on a customer issue where one instance Instance 0 of the Role won’t come up. On the Windows Azure Management Portal all that can be seen is that the Instance 0 keeps on recycling while other instances are running fine. 

The following message is perpetually displayed.

“Recycling (Role has encountered an error and has stopped. Sites were deployed [2013-11-29T12:51:30Z]”

On the new HTML portal it will look something as below.

On the portal we also see the following error message

Back to progress steps

Your role instances have recycled a number of times during an update or upgrade operation. This indicates that the new version of your service or the configuration settings you provided when configuring the service prevent the role instances from running. Verify your code does not throw unhandled exceptions and that your configuration settings are correct and then start another update or upgrade operation. The long running operation tracking ID w c9a4ea6fO14a3ddlodl9lb77bsfbeedl.

So let me explain it with a bit of architecture thrown in. Below is the diagram (shamelessly plagiarized from Kevin Willamson’s blog ) of the workflow of a service getting up and running. I would highly recommend it to be read and understood.

 

                          (Courtesy : Kevin Williamson Senior Escalation Engineer, Windows Azure )

The yellow part is more or less the Virtual Machine getting prepped and ready to host. That is successful since the VM itself is up and we can remote in to it. So let’s see which of the downstream processes is causing the issue. I have used Process Explorer to go through similar troubleshooting in my previous blogs here. This time I used the techniques enumerated in this blog by Kevin. I used WinDbg for my troubleshooting this time.

Since one instance was recycling while the other was running we did an RDP we see the behavior locally. So we compared both the working and non-working scenarios, in the working one Instance 1 the operating system was Windows 2008 R2 while on the non-working one i.e Instance 0 it was Windows 2012 R2. So it seems the redeployment of the code did take place on Instnace_0 but the instance is not responding to RDFE as up and running (refer the image above).

So I used task manager to observe the behavior of the different process shown in the architecture diagram above. I noticed the W3WP process got removed from the task manager on the Windows 2012 R2 virtual machine. I also observed the WaIISHost coming up and vanishing very fast. So based on my experience I knew it was crashing (please refer my previous posts blog1 , blog2 on how I gained this experience). Event viewer didn't have any error registered so a live debugging was required at that time. .

So I used the AzureTools utility that has been developed in house by my team and used it to bind to the WaIISHost process. Then I manually killed the WaHostBootstrapper process as the WaHostAgent will start the WaHostBootstapper and WaHostBootstapper will in turn start WaIISHost. Since the debugger it attached to WaIISHost it will break in the moment it is launched.

Once it broke into debugger very quickly I was able to find the exception. I saw the following error

(bf8.960): CLR exception - code e0434352 (first chance)

Microsoft.WindowsAzure.ServiceRuntime Critical: 201 : ModLoad: 00007ffe`d5300000 00007ffe`d5406000 D:\Windows\Microsoft.NET\Framework64\v4.0.30319\diasymreader.dll

Role entrypoint could not be created:

System.TypeLoadException: Unable to load the role entry point due to the following exceptions:--
System.IO.FileNotFoundException: Could not load file or assembly 'Microsoft.WindowsAzure.ServiceRuntime, Version=2.0.0.0, Culture=neutral,PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The system cannot find the file specified.

File name: 'Microsoft.WindowsAzure.ServiceRuntime, Version=2.0.0.0,Culture=neutral, PublicKeyToken=31bf3856ad364e35'

 WRN: Assembly binding logging is turned OFF.

To enable assembly bind failure logging,set the registry value [HKLM\Software\Microsoft\Fusion!EnableLog] (DWORD) to 1.

Note: There is some performance penalty associated with assembly bind failure logging.

To turn this feature off, remove the registry value [HKLM\Software\Microsoft\Fusion!EnableLog].

 --->
System.Reflection.ReflectionTypeLoadException: Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information.

 at System.Reflection.RuntimeModule.GetTypes(RuntimeModule module)

 at System.Reflection.RuntimeModule.GetTypes()

 at System.Reflection.Assembly.GetTypes()

 at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetRoleEntryPoint(AssemblyentryPointAssembly)

   --- End of inner exception
stack trace ---

 at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetRoleEntryPoint(AssemblyentryPointAssembly)

 at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CreateRoleEntryPoint(RoleTyperoleTypeEnum)

 at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.InitializeRoleInternal(RoleTyperoleTypeEnum)

ModLoad: 00007ffe`e5ec0000 00007ffe`e5f64000 D:\Windows\SYSTEM32\clbcatq.dll

ntdll!NtTerminateProcess+0xa: 00007ffe`e6af683a c3 ret

For OS 4 (Windows 2012 R2) Image path : D:\Windows\Microsoft.Net\assembly\GAC_MSIL\Microsoft.WindowsAzure.ServiceRuntime\v4.0_2.2.0.0__31bf3856ad364e35\Microsoft.WindowsAzure.ServiceRuntime.dll

Image name : Microsoft.WindowsAzure.ServiceRuntime.dll

Version : 2.2

Hence I concluded that the project had not been upgraded for SDK 2.2 or OS family 4 properly. We went to Project Properties and Upgrade to SDK 2.2. This took care to upgrade all related binaries to SDK v2.2.

 

Once the project was upgraded, it was redeployed and this time the role came up fine and didn’t go into recycle state. Also both the instance_0 and instance_1 got the new version of the code.

 

Angshuman Nayak , Cloud Integration Engineer