How many of you are running into this problem & how many of you have been able to find out the cause for this? I am guessing quite a lot of them out there are running into this problem with SharePoint and very few may have found what the actual cause is. The truth is, this error can occur due to many different reasons, primarily due to performance. Recently, I had an opportunity to troubleshoot this problem for a production environment and what the cause turned out to be was quite interesting. So here’s my story and hopefully it will help some of you out there being careful even when making simple customization changes.
One of the first things to understand this behaviour is to understand what causes Request Timed Out Exception. If you turn off custom errors in SharePoint and set the CallStack attribute to true in the web.config file for the problem application, you will notice that the exception is of type System.Web.HttpWebException. This exception is thrown by the .NET CLR. The CLR provides a quantum of time to the managed thread that is executing the request. By default, this is 110 seconds in .NET CLR 2.0. If this threshold is reached, the request is aborted and this particular exception is thrown. Why do we do this? To prevent runaway threads. So this is actually a fail safe mechanism. However there are cases where you might need to adjust it to a higher value. Be careful when you do this because if you have lots of requests that are allowed to run for a long time, then you will end up exhausting the CLR threadpool causing more problems such as queuing of requests which appear as a hang. Unfortunately, there are a few Microsoft KB articles that recommend increasing this value to as high as 999999. My opinion is – Do not do this on production servers (atleast)
So now that we know what causes this exception, let me explain how I went about getting to the root cause. First – We scoping the problem area. This takes up 90% of the time. Since this is a performance problem, there are a few possibilities
- Possible performance issues with web server or SQL Server.
- Possible performance issue with a single site or site collection.
I am not including the network piece here – Why? Because this is a server side behaviour not network latency. The server is aborting the request because it has been running for too long.
The problem did not have specific symptoms. It could occur while browsing any site within the site collection or performing administrative tasks such as adding users, creating sites, adding items to a list. So basically any activity could trigger this behaviour.
Scoping Step 1: I created a new site collection in the problematic web application and checked what the performance is like. It was fast and we never experienced any timeout errors. So the problem was specific to a particular site collection.
Scoping Step 2: I looked about in the UI to see how different it is from the default SharePoint sites. I immediately realized that there were some customizations in place, mostly cosmetic. So that didn’t look like a problem to me.
Scoping Step 3: I needed some data while running into this problem to see what SQL queries and stored procedures are being called. Based on our possibilities listed above, I also wanted to see performance data from the web and SQL Servers. So I setup the SQL Server to capture a profiler trace with capturing data only from the PID of the IIS worker process hosting this web application. I also setup the web server to capture performance monitor log including all counters for the processor, process, thread, .NET CLR exceptions and Web Service counters. There is a reason for using these specific counters. These counters will provide me with the following information:
- The process that is taking up CPU cycles.
- The overall processor usage.
- The thread that is taking the most CPU cycles
- Exceptions are expensive. .NET CLR Exceptions counter will give me information if there are too many exceptions occurring in the IIS worker process hosting this application.
- Web Service Counters will give me information about the load on the web server
Once I got this information, I quickly realized that with a single request, we had a processor usage in excess of 60% which was sustained and lasted well over 100 seconds. I then looked at the SQL Profiler traces. When I look at the SQL profiler traces, I am usually trying to figure out
- If any particular query is taking up too much time to execute. (This can be figured out by using the Duration column)
- If there are queries that are invoked a lot of times
You can easily do this using the Organize Columns button in the Events Selection tab of the SQL profiler trace to group by either duration or ObjectName. Grouping by duration can quickly tell you the long running queries and grouping by ObjectName will give you the stored procedures invoked the most number of times. Here’s a screenshot
Both of the above scenarios could result in performance problems. In this case, I had data like the following from the SQL Profiler trace after grouping by ObjectName column.
|Object Name||Invocation Count|
Needless to say there is a pattern in which these stored procedures are being called. For example proc_SplitUrl invokes fn_GetLeafName and fn_GetDirName the same number of times. proc_SplitUrl is used by many stored procedures and let us see if we can match it up with something else over here that executes a large number of times... and we have proc_UrlToWebUrlOutput. This in turn maps to proc_GetTpWebMetaDataAndListMetaData. Thus the chain of execution is as follows:
proc_SplitUrl > proc_UrlToWebUrlOutput > proc_GetTpWebMetaDataAndListMetaData
If we lookup the documentation for these stored procedures on MSDN, you will notice that these stored procedures are invoked anytime SharePoint needs to enumerate the metadata for the site/list for whatever activity it is doing. The information we have so far isn’t enough to determine what the cause is. We know that some stored procedures are being invoked a very large number of times for a single request and the stored procedures being invoked are primarily to get metadata for the site/list. So this forms one part. The other unanswered part is – What user activity is causing us to enumerate the metadata? We already know that the problem occurs on any type of activity. So are all activities enumerating metadata?
Recollect that I also mentioned that my observations from perfmon collected from web servers shows sustained CPU utilization in excess of 60% for a single request. That is very bad. Imagine what could happen on a busy production server. I collected multiple hang dumps using Debugging tools for Windows to see what the thread(s) are doing? Maybe that has a direct relation on the SQL activity as well? Clearly, reviewing memory dumps require some advanced skill sets. There are many blogs out there that can get you started on managed debugging. Here’s a link to the most popular one. What I observed was quite interesting. All of the busy threads had a managed callstack where this portion was common:
From my domain knowledge of SharePoint and developer background, I immediately realized that all of this was being trigged by the use of a menu control. Check the highlighted portion above.
OK, so there are umpteen menus in the UI. Which one are we talking about? I poked to find this information from the memory dump. The Microsoft.SharePoint.WebControls.AspMenu on the managed stack has a property named _id that holds the name of the menu. When I dumped it out, the name of the menu was: TopNavigationMenu
So now, we are ready to put our story together backed up by data.
CONCLUSION: While building the user interface for any request, we need to build the top navigation menu since it is part of every page. While doing so, we end up enumerating a lot of metadata for many different sites. We looked into the number of top level sites and sub sites. There were few top level sites and many child & sub child sites. So why are we enumerating all of them? When I looked at the top menu, I found that just hovering over a menu item will throw open the list of sub sites and hovering over the sub menu will throw open the sub sites list for that site and so forth. In short, the nest level for showing the sub sites was more than 1 (which is the default). We figured that using SharePoint designer, a developer had changed the nest level to 4. Thus we end up enumerating the metadata for all of the sub sites – 4 levels deep, just so that we can use that data for security trimming.
We turned off the “Show subsites” option using the top level site settings: Portal > Site Settings > Modify Navigation
This immediately fixed the Request Timed out exception since we are no longer displaying the top level menu structure. So we proved our findings by making this change. The end resolution for this issue was to change the nest level of the top menu bar to the default of 1 or whatever value is appropriate for you without running into the Request Timed out exception.
Many of you may be wondering… such a trivial thing created such a huge performance problem? Well, yeah. Think about what impact any customization (even the smallest change) can have. The developer wouldn’t have noticed this in the beginning because the number of sites in the hierarchy may have been relative small. As the site structure grows, this quickly becomes a performance problem.
The other scenarios that I am aware of and by the way, a very common one that we see in our support cases, are large lists that even our customers were not aware of existed in their environment. My first suggestion is to click on the View All Site Content link in the quick Launch bar and see if you can find any large lists using the Items column. Keep in mind that there may still be large lists that are not visible here, such as hidden lists. How Microsoft engineers figure out large lists will probably be the next blog post. 🙂