A boring, but typical, difficult case ---- ASP.NET session lost

A boring, but typical, difficult case, ASP.NET session lost

Problem Description

It was an large ASP.NET project. In testing environment, everything is fine. In production environment, when the load went up, an unhandled NullReferenceException occurred randomly. Based on the analysis on code, it occurred when trying to access an Session Object, which should be already set somewhere. The issue only occurred 3 times in the past half year, and in 3 different pages.

The Strategy

If the problem occurs a bit frequently, we may get a chance to gather some common pattern, like some special user action, or special data input. But it occurred only 3 times, almost no chance for further test. We must make very comprehensive and solid action plan so that we captured enough info when the problem occurs again.

The way to think is quite straight forward. Learn the detail about session. Summarize all the possibilities and capture info accordingly.

For the Session in detail, please refer to:

Underpinnings of the Session State Implementation in ASP.NET

https://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnaspp/html/ASPNetSessionState.asp

With full understanding of ASP.NET session, the issue should fall into one of the following possibilities:

1. It is a load-balance environment, but we are using In-Proc session, or the machine-keys are not synced in these services. The customer confirmed there was only one server, with In-Proc session. Thus this is not the case.

2. The simplest situation is that all the users’ session get lost when the problem happens. Such situation is usually caused by appDomain recycle or IIS crashes. We can check performance monitor (appDomain Restart counter) or Event log (for crash) to check.

3. It is a bit difficult if the problem only impacts some user(s), not all. To verify with log, in Session_Start event, we can save the session creation time and current session id into a local log file. Meanwhile, we will save another test value into the session. When the problem occurs, in the ASP.NET global error handling function, we get the session id for current user, and compare with the local log. If the session is just created, it is likely a client side problem like cookie lost.

4. If the session is not just created, then let’s check if the test value we saved inSession_Start is lost either. If the test value gets lost as well, we are sure that some code in server side clears the whole session, probably Session.Clear.

5. If the session is not just created, and the test value is still there, the issue should be caused by the last modification on the null session value which causes the NullReferenceException. In such situation, we have to log all the Session operations on that session object in server side to trace.

As above, the different behaviors are tightly bound to different potential causes. As a summary, we need to figure which situation below it belongs to:

1. All the session values get lost for all the users.

2. All the session values get lost for a specified user.

3. Some session values get lost for a specified user, while other session values are still there.

Then we can decide the next step for further narrow down, like:

1. Check appDomain Recycle or Crash for situation 1.

2. Checking session creation time for situation 2.

3. Search Session.Clear function call in all the code, or use debugger to set condition breakpoint during execution for situation 3.

Action and Root Cause

Based on above analysis, we started working on the code to implement the log mechanism. The customer also added further log to trace some important function invocations.

After one week waiting, the problem gets reproduced. In log file, we found:

1. Some session values get lost for a specified user, while other session values are still there

2. Based on the session creation time, the session was alive for a long while.

3. From the execution trace in log file, the problematic session value is indeed cleared by server side code. The design is, after a workflow cycle, sessions are cleared in a central place. After that, the page will be redirected to a special page for initialization job for next workflow cycle. The log shows that, there is only clearance job to clear the session, no initialization job to reset the session.

With above information, we started careful code review for the joint place. I found the redirection is performed by client side javascript. It is a typical fault for web application. Since javascript depends on client browser behavior very much, it is better to use http 302 for redirection instead of client script. It is also a must to use server side code to verify everything that relies on client.

HTTP Status code definition

https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

With above change, the problem gets resolved. Session lost is not an interesting case, but it is very typical:

1. The issue behaves simple, straight forward, but hard to reproduce.

2. The root cause resides quite earlier than the problem behavior. We cannot get very useful info if we only check when the problem appears out. We have to collection information proactively.

3. Fancy conditional breakpoint in windbg are cool, but log is the last reliable friend for really tough cases. Deep understanding of the product and carefully analysis on the problem help effective log.

Next I will discuss, How the Chinese UI in SharePoint turns to English randomly.