Problem on Updating ASP.NET Web Content in Failover Cluster

 

ASP/ASP.NET allows you to change web site content when the site is alive on IIS. But recently we find a problem in IIS failover cluster.

Issue

This is a failover cluster of two IIS 6 servers, and we have a test web site on them which contains a simple HTML page and an ASPX page. The web site is placed on a disk resource in the failover cluster.

The page content is exactly the same, a line of text “ABC”.

Then when we open IE on a third machine and navigate to the three pages with cluster name in the address bar (such as https://clustername/test.htm), we see “ABC” is displayed. Note that actually we are accessing the IIS site on the active node (let’s name it Node A).

Now we force the cluster to fail over to Node B. OK, it is time to use Notepad to modify the pages and replace “ABC” with “123”. Done? This time IE will show “123”. Right?

The problem only happens if we fail over to Node A again. What you see now if IE refreshes?

In my test box, the HTML page is displaying “123” as expected, but “ABC” is still displayed in the ASPX page. Why?

Analysis

The first step I tried is to check the files on the disk resource. The strange thing was that the files are up-to-date. So it can be a cache issue. Then I simply reset IIS with “iisreset” in a command prompt. OK, this time IE shows the right page.

Well, why ASP.NET does not pick up the changes if it works fine in a single server environment (in fact, it also works fine in NLB cluster)? We employed a lot of tools and finally narrow the problem down to File Change Notification (FCN) mechanism, which is used by ASP.NET to monitor changes of web content.

When we access an ASP.NET site on one node, a w3wp.exe is created and launched to host ASP.NET and automatically the web content is monitored by ASP.NET. If there is any change received from FCN, w3wp.exe will clear ASP.NET cached content and use latest web content.

However, if the failover happens and the disk resource is moved to another node, the previous FCN registration on the node will become invalid. As a result, even if the node becomes active after another failover, IIS/ASP.NET has no way to know if the web content was changed and old web content is served until we restart IIS (or recycle the application pool).

Resolution

An interesting thing to notice is that you will only meet this problem if the failover happens frequently. By default IIS will stop idle worker processes on the passive node after 20 minutes. The ASP.NET cached content will be cleared once new worker processes appear.

So the possible workarounds are,

1. Reset IIS on Active Node After Failover

Its disadvantage is that services are interrupted.

 

2. Reset IIS on Passive Node Before Failover

This has a smaller impact on the services, as passive node does not serve incoming requests after failover. However, a person must be involved in the process to perform the reset, which is not optimal.

 

3. Decrease Idle Time Allowed For Worker Process

If there is no running w3wp processes on the passive node, then this problem will not happen after failover. This gives us the third workaround. That is, we can change IIS 6 application pool settings so that IIS shuts down idle worker processes on the passive node in a shorter time than default (20 minutes). Then if the failover frequency is larger than this idle time setting, we can also prevent the problem from happening.

 

There is still some drawback for this workaround, as in this way the worker process can be shut down more frequently, which may not be optimal for ASP.NET applications which have to be recompiled at startup and the recompilation takes time.

 

4. Use a Customized IIS Cluster Script

This is by far the most efficient way we found. The background is that we add a custom section into the standard IIS cluster script clusweb.vbs (modifications are in Offline function) in order to shut down existing w3wp processes during failover, so there would be no more w3wp.exe left on the passive node.

 

‘sample code for IIS 6

Function Offline( )

strComputer = "."

Set ObjWMIService= GetObject("winmgmts:" _

 & "{impersonationLevel=impersonate}!\\" _

 & strComputer & "\root\cimv2")

Set w3wpProcessList = objWMIService.ExecQuery _

  ("Select * from Win32_Process Where name = 'w3wp.exe'")

For Each w3wpProcess in w3wpProcessList

    w3wpProcess.Terminate()

Next

End Function

‘for IIS 7 there is another way

Dim STOP_APP_POOL

STOP_APP_POOL = 1

'Start the application pool for the website

Function StopAppPool()

    Dim ahwriter, appPoolsSection, appPoolsCollection, index, appPool, appPoolMethods, startMethod, callStartMethod

    Set ahwriter = CreateObject("Microsoft.ApplicationHost.WritableAdminManager")

    Set appPoolsSection = ahwriter.GetAdminSection(APPLICATION_POOLS_SECTION_NAME, CONFIG_APPHOST_ROOT)

    Set appPoolsCollection = appPoolsSection.Collection

    index = FindAppPoolIndex(appPoolsCollection, APP_POOL_NAME)

    Set appPool = appPoolsCollection.Item(index)

    'See if it is already stopped

    If appPool.GetPropertyByName("state").Value <> 1 Then

        StopAppPool = True

        Exit Function

  End If

    'Try To stop the application pool

    Set appPoolMethods = appPool.Methods

    Set startMethod = appPoolMethods.Item(STOP_APP_POOL)

    Set callStartMethod = startMethod.CreateInstance()

    callStartMethod.Execute()

    'If stop return true, otherwise return false

    If appPool.GetPropertyByName("state").Value <> 1 Then

        StopAppPool = True

    Else

        StopAppPool = False

    End If

End Function

Function Offline( )

          StopAppPool()

          Offline = true

End Function

Also note that IIS NLB cluster does not experience such a problem and has its other advantages over failover cluster, so it is highly recommended to use NLB cluster for HA set up (more information is provided in KB970759).

Last Question

Does this problem also apply to ASP pages? You can set up a test environment to have a look.

Regards,

Lex Li