Optimal Azure Restarts

Restarts for Web Roles

Updated 17 Jan 2013: Tracing in OnStop is not supported.

An often neglected consideration in Windows Azure is how to handle restarts. It’s important to handle restarts correctly, so you don’t lose data or corrupt your persisted data, and so you can quickly shutdown, restart, and efficiently handle new requests.  Windows Azure Cloud Service applications are restarted approximately twice per month for operating system updates. (For more information on OS updates, see Role Instance Restarts Due to OS Upgrades.) When a web application is going to be shutdown, the RoleEnvironment.Stopping event is raised. The web role boilerplate created by Visual Studio does not override the OnStop method, so the application will have only a few seconds to finish processing HTTP requests before it is shut down. If your web role is busy with pending requests, some of these requests can be lost. You can delay the restarting or your web role by up to 5 minutes by overriding the OnStop method and calling Sleep, but that’s far from optimal. Once the Stopping event is raised, the Load Balance (LB) stops sending requests to the web role, so delaying the restart for longer than it takes to process pending requests leaves your virtual machine spinning in Sleep, doing no useful work. The optimal approach is to wait in the OnStop method until there are no more requests, and then initiate the shutdown. The sooner you shutdown, the sooner the VM can restart and begin processing requests. To implement the optimal shutdown strategy, add the following code to your WebRole class.

 public override void OnStop()
{
    Trace.TraceInformation("OnStop called WebRole");
    var pcrc = new PerformanceCounter("ASP.NET", "Requests Current", "");

    while (true)
    {
        var rc = pcrc.NextValue();
        Trace.TraceInformation("ASP.NET Requests Current = " + rc.ToString());
        if (rc <= 0)
            break; 
        System.Threading.Thread.Sleep(1000);
    }
}

The code above checks the ASP.NET request’s current counter. As long as there are requests, the OnStop method calls Sleep to delay the shutdown. Once the current request’s counter drops to zero, OnStop returns, which initiates shutdown. Should the web server be so busy that the pending requests cannot be completed in 5 minutes, the application is shut down anyway. Remember that once the Stopping event is raised, the LB stops sending requests to the web role, so unless you had a massively under sized (or too few instances of) web role, you should never need more than a few seconds to complete the current requests.

The code above writes Trace data, but unless you perform a tricky On-Demand Transfer, the trace data from the OnStop method will never appear in WADLogsTable. Later in this blog I’ll show how you can use DebugView to see these trace events. I’ll also show how you can get tracing working in the web role OnStart method.

Optimal Restarts for Worker Roles

Handling the Stopping event in a worker role requires a different approach. Typically the worker role processes queue messages in the Run method. The strategy involves two global variables; one to notify the Run method that the Stopping event has been raised, and another global to notify the OnStop method that it’s safe to initiate shutdown. (Shutdown is initiated by returning from OnStop.) The following code demonstrates the two global approaches.

 public class WorkerRole : RoleEntryPoint
{
    private CloudQueue myWorkQueue;
    private volatile bool onStopCalled = false;
    private volatile bool returnedFromRunMethod = false;

    public override void Run()
    {
        CloudQueueMessage msg = null;
        while (true)
        {
            try
            {
                // If OnStop has been called, return to do a graceful shutdown.
                if (onStopCalled == true)
                {
                    Trace.TraceInformation("onStopCalled WorkerRole");
                    returnedFromRunMethod = true;
                    return;
                }
                // Retrieve and process a new message.
                msg = myWorkQueue.GetMessage();
                if (msg != null)
                {
                    ProcessQueueMessage(msg);
                }
                else
                {
                    System.Threading.Thread.Sleep(1000);
                }
            }
            catch (Exception ex)
            {
                string err = ex.Message;
                if (ex.InnerException != null)
                    err += " Inner Exception: " + ex.InnerException.Message;
                Trace.TraceError(err);
                // Don't fill up Trace storage if we have a bug in  process loop.
                System.Threading.Thread.Sleep(1000);
            }
        }
    }

    public override void OnStop()
    {
        onStopCalled = true;
        Trace.TraceInformation("OnStop called from Worker Role.");
        while (returnedFromRunMethod == false)
        {
            Trace.TraceInformation("Waiting for returnedFromRunMethod");
            System.Threading.Thread.Sleep(1000);
        }
        Trace.TraceInformation("returnedFromRunMethod is true, so restarting");
    }

    private void ProcessQueueMessage(CloudQueueMessage msg)
    {
        // Code omitted for clarity.
        System.Threading.Thread.Sleep(1000);
    }

    private void ConfigDiagnostics()
    {
        // See https://bit.ly/UXM44C
    }

    public override bool OnStart()
    {
        // Code omitted for clarity.

    }
}

When OnStop is called, the global onStopCalled is set to true, which signals the code in the Run method to shut down at the top of the loop, when no queue event is being processed.

Viewing OnStop Trace Data

As mentioned previously, unless you perform a tricky On-Demand Transfer, the trace data from the OnStop method will never appear in WADLogsTable. We’ll use Dbgview to see these trace events. In Solution Explorer, right-click on the cloud project and select Publish.

selectPub

Download your publish profile.  In the Publish Windows Azure Application dialog box, select Debug and select Enable Remote Desktop for all roles.

PubSettings

The compiler removes Trace calls from release builds, so you’ll need to set the build configuration to Debug to see the Trace data. Once the application is published and running, in Visual Studio, select Server Explorer (Ctl+Alt+S). Select Windows Azure Compute, and then select your cloud deployment. (In this case it’s called t6 and it’s a production deployment.) Select the web role instance, right-click, and select Connect using Remote Desktop.

serverExp

Remote Desktop Connection (RDC) will use the account name you specified in the publish wizard and prompt you for the password you entered. In the left side of the taskbar, select the Server Manager icon.

rdcSelectSvrMgr

In the left tab of Server Manager, select Local Server, and then select IE Enhanced Security Configuration (IE ESC). Select the off radio button in the IE ESC dialog box.

disableIE

Start Internet Explorer, download and install DebugView. Start DebugView, and in the Capture menu, select Capture Global Win32.

dbgCapture

Select the filter icon, and then enter the following exclude filter:

Heartbeat;*is reporting state Ready.;*has current state Started*;*WaIISHost.exe Information:*;*Microsoft.WindowsAzure.ServiceRuntime Information:*;*Invalid parameter passed *;*w3wp.exe Information:*;

 

dbgFilter

For this test, I added the RoleEnvironment.RequestRecycle  method to the About action method, which as the name suggests, initiates the shutdown/restart sequence. Alternatively, you can publish the application again, which will also initiate the shutdown/restart sequence.

dbgOnStop

 

Follow the same procedure to view the trace data in the worker role VM. Select the worker role instance, right-click and select Connect using Remote Desktop.

workerRoleRDC

Follow the procedure above to disable IE Enhanced Security Configuration. Install and configure DebugView using the instructions above. I use the following filter for worker roles:

Heartbeat;*is reporting state Ready.;*has current state Started, desired *;*w3wp.exe Information: *;*WaWorkerHost.exe Information*;*Invalid parameter passed *;*Microsoft.WindowsAzure.ServiceRuntime Information:*;

For this sample, I published the Azure package, which causes the shutdown/restart procedure.

dbgViewWorker

One last departing tip: To get tracing working in the web roles OnStart method, add the following:

 Trace.Listeners.Add(
   new Microsoft.WindowsAzure.Diagnostics.DiagnosticMonitorTraceListener());

Tracing in OnStop

Tracing in the OnStop method is not supported and not recommended for the following reasons:

  1. The OnStop method should be used only to delay shutdown until you’ve processed all pending requests (web role) or completed a unit of work (worker role) so you get a clean shutdown. As stated above, you should return from OnStop (which initiates shutdown) as soon as possible, as soon as you can get your application in a clean shut down state. The OnStop method should be very simple, so you shouldn’t need to trace in this method.
  2. OnStop is called from a different process which doesn’t read the trace listener configuration in the app.config (worker role) or web.config (web role) file. To get it working you’d have to configure the azure diagnostics trace listener programmatically to hook up the listener to the agent process that manages the PaaS VM role.
  3. There are inherent polling wait times during On-demand transfer of a minute or more.  It is not well suited for execution during the time limited OnStop.

Most of the information in this blog comes from Azure multi-tier tutorial Tom and I published last week. Be sure to check it out for lots of other good tips.

Follow me ( @RickAndMSFT )   on twitter where I have a no spam guarantee of quality tweets.