In my line of business I normally do a fair amount of debugging but not as much as an Escalation Engineer would do. This was just one of those problems I was running into and since I really didn't have a clue on what would be the next steps, I resorted to gathering dumps (which didn't really help). But in this blog i am going to try and approach this problem from a troubleshooting perspective to understand the issue and maybe give you an insight on how PSS works.
The title would normally tell you where i am coming from, but here is a brief description.
Customer had written a brand new application in ASP.NET 2.0 and while running a stress test he observed that the Cache objects were getting trimmed almost immediately once they got created. Interestingly there really wasn't a "Memory Pressure"
A bit of clarification here. When I say that "Memory Pressure" wasn't high, what I really mean is that the total committed memory for the worker process was showing up as 100 MB ONLY. This does have to do something with memory pressure, but I will open that card when I come to the end of this blog 🙂
As usual when a customer calls in and says "I am running into such and such problem" and this is my environment, I get skeptical, not because I don't believe the customer, but just because I don't believe anything until I see it. There are too many cases I have dealt with where we were focused on a particular issue and were getting nowhere only to realize later that the information provided was insufficient or incorrect. And a lot of times, its not the customer to blame but someone else down the line who provided the wrong information.
So the first thing i told Punit (the SE who got this case) was to verify what the customer means when he says he is running into a cache problems getting trimmed, what does he really mean? In this case the customer really knew what he was talking about and to isolate the issue he created a simple application which added about 1000 items to the cache and he had implemented a callback method for these cache items. This callback method would then enumerate the items removed from the cache and display the reason for expiration.
On the customers environment we could see that the cache items were almost immediately trimmed in less than a minute and the reason showing up was "Underused". If we take a look at the documentation (http://msdn2.microsoft.com/en-gb/library/system.web.caching.cacheitemremovedreason(VS.80).aspx) this mentions that "Underused" is when "The item is removed from the cache because the system removed it to free memory"
This was a bit surprising to me because when we looked at the worker process in the Task Manager it was only showing up to be around 100 MB in size. We asked for the sample application and tried to reproduce this behavior and it worked perfectly in our environment. With some experimenting and playing around with the config file at the customers end, we found out that if we increase the value of percentagePhysicalMemoryUsedLimit to 90 or above it doesn't trim the cache items.
So the question now would be why is the Worker Process thinking that its under "Memory Pressure" and trimming the cache objects? First thing that came to my mind was that since this is only specific to the customers environment maybe he just doesn't have enough RAM on this machine. This theory fell to the ground when we found out that he has 2 GB of RAM.
OK, so what else? Dependencies? Maybe the cache items we are inserting have dependencies like File Dependencies or SQL Dependencies. Well from theory we know that the dependent item (cache object) will automatically be removed from the Cache if the dependency changes. So if I have a File Dependency set up and if something is modifying this file like for example a Virus Scan it would cause the Cache item to be removed. But this wasn't the case either, there were no dependencies associated with these items.
Application Restart's? This could be another possibility, Cache is global to the application domain itself, so if due to some circumstances like for example the web.config or global.asax is touched by a virus scan, it could cause a App Domain Recycle resulting in the cache getting trimmed. There are various other reasons for this to happen and its beyond the scope of this blog so I will keep it short for now.
So at this point, I was running out of ideas and I did the next logical thing we do here at PSS, take dumps and Perfmon data. The counters in perfmon that i was looking for were around cache, specifically under the "ASP.NET Application Counters". I was also interested in looking at the counters under .NET Memory to see if I can find anything in there.
When i looked at the perfmon data, it just proved what we already knew. The counters i was looking at were, the Cache Total Entries, Cache Total Turnover Rate, Cache API Entries, Cache API Turnover Rate. Interestingly the turnover was large and the total entries was going up and then suddenly falling down.
Next thing I did was to look at the memory dumps (we had taken three consecutive dumps and when i looked at the Private Bytes Pressure this is how it looked.
0:000> !do 05597268
Size: 52(0x34) bytes
GC Generation: 2
MT Field Offset Type VT Attr Value Name
790fed1c 4001345 8 System.Int32 0 instance 81_pressureHigh
790fed1c 4001346 c System.Int32 0 instance 80 _pressureMiddle
790fed1c 4001347 10 System.Int32 0 instance 79 _pressureLow
790fed1c 4001348 14 System.Int32 0 instance 0 _i0
791240f0 4001349 4 System.Int32 0 instance 0559729c _pressureHist
790fed1c 400134a 18 System.Int32 0 instance 12 _pressureTotal
790fed1c 400134b 1c System.Int32 0 instance 2 _pressureAvg
790fcb80 4001350 20 System.Int64 0 instance 1286602752 _memoryLimit
79105ef0 4001358 28 System.DateTime 1 instance 05597290 _startupTime
79104f64 4001351 808 System.Boolean 0 shared static s_isIIS6
>> Domain:Value 0016cc10:NotInit 0019ef10:0 <<
790fcb80 4001352 7f0 System.Int64 0 shared static s_autoPrivateBytesLimit
>> Domain:Value 0016cc10:NotInit 0019ef10:-1 <<
790fcb80 4001353 7f8 System.Int64 0 shared static s_totalPhysical
>> Domain:Value 0016cc10:NotInit 0019ef10:0 <<
790fcb80 4001354 800 System.Int64 0 shared static s_lastReadPrivateBytes
>> Domain:Value 0016cc10:NotInit 0019ef10:26214400 <<
79105ef0 4001355 1c8 System.DateTime 1 shared static s_lastTimeReadPrivateBytes
>> Domain:Value 0016cc10:NotInit 0019ef10:055af9e8 <<
79105474 4001356 80c System.UInt32 0 shared static s_pid
>> Domain:Value 0016cc10:NotInit 0019ef10:6104 <<
790fed1c 4001357 810 System.Int32 0 shared static s_pollInterval
>> Domain:Value 0016cc10:NotInit 0019ef10:120000 <<
So surprisingly from the above results, it looks like the the memory limit is 1.2 GB. I did a lot more digging around in the dump to make some sense of the results I was seeing, but didnt make headway, so I took the next step at this point and engaged the Product Team by raising an internal bug. I went back and forth with Thomas from the product team and finally found out the problem was related to the "Available Memory" being low. Now Cache objects are trimmed based on the algorithm
Pressure = (Total Memory - Available Memory)/Total Memory
So in the customers case the total memory was 2 GB, Available was 350 MB and this expression evaluates to
(2048-350)/2048 = 0.82 which means 82%.
The customer had set the percentagePhysicalMemoryUsedLimit to 80% and thats ASP.NET was trimming the cache items. As a rule when we go over the _preassureHigh value we drop half the cache entries and if we go over _preassureMiddle we drop a small percentage.
ASP.NET 2.0 attempts to flush roughly 50% of the memory consumed by cached entries (both output cached and objects cached using the Cache object) when memory consumption for a process approached 90% of the memory consumption limits. On a 2GB 32-bit machine, the default memory limit is 800MB and on a /3GB 32-bit machine (or WOW64) the limit is 1.8 GB.