Troubleshooting Scenario 3 – Role Stuck in Busy

Continuing from the troubleshooting series at Windows Azure PaaS Compute Diagnostics Data, this blog post will describe how to troubleshoot a role that is stuck in the Busy state with the error message “Busy (Waiting for role to start... Application startup tasks are running.)”.



You are deploying a service to Azure and your service never starts correctly and never gets to the Ready state.  It is stuck forever in the Busy state.



Get the Big Picture

Just like in Scenario 1 we want to RDP onto the Azure VM and get a high level overview of what is happening on the VM.  We see WaHostBootstrapper running, but we do not see WaWorkerHost.exe and watching task manager for a minute or two we don’t see any other changes.  From the notes in Scenario 1 we know that this probably means a problem with a startup task.



Check the logs

We know from Windows Azure PaaS Compute Diagnostics Data that the WaHostBootstrapper logs are in C:\Resources\WaHostBootstrapper.log.  Checking that folder we see only one WaHostBootstrapper.log file, and no instances of WaHostBootstarpper.log.old.<index>.  This means that WaHostBootstrapper has only tried to start one time and it isn’t recycling due to an error.

WaHostBootstrapper log:

I am going to skip over all of the stuff that we expect to see in every host bootstrapper log (see Scenario 2 for more info) and go straight to the bottom of the log because I am interested in whatever host bootstrapper is doing right now, not what it did in the past.  These are the last two lines in the file:

[00002968:00001680, 2013/09/05, 21:47:15.487, INFO ] Executing Startup Task type=0 rolemodule=(null) cmd="E:\approot\StartupTasks\Startup.cmd"
[00002968:00001680, 2013/09/05, 21:47:15.487, INFO ] Executing "E:\approot\StartupTasks\Startup.cmd" .

We can compare the timestamp in host bootstrapper (2013/09/05, 21:47:15.487) to the current system time (2013/09/05, 22:13:34.581) to see that the host bootstrapper has been trying to run E:\approot\StartupTasks\Startup.cmd for about 30 minutes.  Notice that the task type is 0 (“Executing Startup Task type=0”) which is a simple startup task which causes the host bootstrapper to wait until the task finishes before continuing execution (see the taskType entry at 

Note that we can also get this same information from the event viewer.  We can see that the last thing that Azure is doing is trying to run the startup task:



Check the startup task and test the fix

Now that we know where we are failing lets go check the startup task.  Here are the contents of e:\approot\StartupTasks\Startup.cmd:

ECHO Press Enter When Ready
REM Other commands here...

This one is pretty simple and obvious to tell what is wrong, but if this were more complicated logic then we could run the startup task manually, attach a debugger to it, etc.  Now it would be nice to test a quick fix to the startup task to make sure it fixes the problem without having to completely redeploy the solution.  Using the information at How to Modify a Running Azure Service we know that we can terminate WaHostBootstrapper, replace the Startup.cmd, and then wait.

I modified Startup.cmd to remove the PAUSE line, and terminated WaHostBootstrapper, and waited for a minute, and now I see WaWorkerHost.exe running:


And the WaHostBootstrapper.log shows that my startup task finished correctly and the role host process (WaWorkerHost) was started:

[00002844:00002244, 2013/09/06, 14:28:55.386, INFO ] Executing Startup Task type=0 rolemodule=(null) cmd="E:\approot\StartupTasks\Startup.cmd"
[00002844:00002244, 2013/09/06, 14:28:55.386, INFO ] Executing "E:\approot\StartupTasks\Startup.cmd" .
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Program "E:\approot\StartupTasks\Startup.cmd"  exited with 0. Working Directory = E:\approot\
[00002844:00002244, 2013/09/06, 14:28:55.401, ERROR] <- GetDebugger=0x1
[00002844:00002244, 2013/09/06, 14:28:55.401, ERROR] <- GetRoleHostDebugger=0x1
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Executing base\x64\WaWorkerHost.exe .
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Role host process PID: 4036.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Registering client with PID 4036.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Client WaWorkerHost.exe (4036) registered.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Client process 4036 is the role host.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Role host process registered.

Comments (6)

  1. andrew says:

    This is terrible.  The whole point of PAAS is that you don't have to screw around on the machines.  Why doesn't Visual Studio tell you what the @#$%ing problem is instead of displaying the wonderful message

    Your role instances have recycled a number of times during an update or upgrade operation. This indicates that the new version of your service or the configuration settings you provided when configuring the service prevent the role instances from running. Verify your code does not throw unhandled exceptions and that your configuration settings are correct and then start another update or upgrade operation.

  2. Andrew, you are 100% correct.  There have been significant improvements in this area since Azure was released and these types of difficult to troubleshoot problems are far less frequent today than they used to be.  We have also identified several more common failure paths and have some nice improvements in the release pipeline that should further help with this scenario.

  3. Ismail Ahmed Syed says:

    This is a very nice post especially the troubleshooting section , it has helped  me a lot .Thanks Kevin

    Also kevin,  as part of startup tasks does cloud services connect to db also?

  4. Thanks for the feedback Ismail.

    I am not sure what you mean when you ask if a cloud service connects to db in a startup task?  Your startup task can be anything you want it to be (batch file, powershell, custom EXE, etc), and within the startup task you can do anything you want to do (connect to a DB, download files, install MSIs, etc).

  5. Corey says:

    I agree with Andrew, this is ridiculous. The portal or studio should report the problem, what you have presented here is unacceptable.

  6. Linda says:

    Which part of the solution (which files) might be "throwing unhandled exceptions"?


Skip to main content