One of the common scenarios in the SAP space today is to perform a backup to local disks and then have an agent of a centralized backup infrastructure pick up the backup files and copy those to the backup infrastructure. No problem with that. However having full database backups on the productive database server also triggers other scenarios where the backup files should be copied manually to other destinations. E.g. building up a sandbox system in production size or synchronizing a test system or development system with the production volume database. Dealing with databases in a few TB size one usually wants to get through with such a copy pretty fast. Hence a huge advantage is to use database backup compression as offered with SQL Server 2008 or offered by 3rd party. Another advantage is to have multiple backup files so that one could start these copies in parallel in order to have the copying taking as little time as possible. Another side condition might be that the server the copy is going towards is not close by, but might sit in the DR site. Hence the copy goes through a rather slow line. This exactly was the scenario which we had to copy our compressed backup files (1.2TB out of 6TB database volume of our SAP ERP database) to a server in the DR site. We knew that the link to the DR site is not all that great. Hence we tried to reduce the time by starting 4 or 6 RoboCopy jobs in parallel.
However we within 3 weeks encountered two situations where our SAP ERP system more or less seems to be stuck. Investigating immediately we realized that we even couldn’t login into the database server anymore. We still could ping the server, we also could get Terminal Service Client connected. But after typing in the password, nothing happened than looking at an hour glass. At the end we had to take the server down hard. The second occurrence we were luckier. One of our Basis persons had a TS session alive to the server and SQL Server Management Studio connected when our ERP system showed the same symptoms again. This time we could connect to SQL Server as well in order to analyze. This time SQL Server also wrote minidumps about non-yielding schedulers. The most concerning data was found right at the top of the mini-dumps. This was the list of memory:
MemoryLoad = 99%
Total Physical = 73724 MB
Available Physical = 3 MB
Total Page File = 104552 MB
Available Page File = 55236 MB
Total Virtual = 8388607 MB
Available Virtual = 8312979 MB
It wasn’t too hard to account the problem of a non-yielding scheduler to the fact that there hardly was any physical memory left. SQL Server as itself was configured with ‘lock pages in memory’. But having hardly any physical memory left certainly could trigger all kinds of issues given that ‘lock pages in memory’ only applies to SQL Buffer Pool (including associated cache areas which are taken out of the Buffer Pool) and not to what we call mem-to-leave (direct memory allocations of SQL Server). Also checking the new DMV sys.dm_os_ring_buffers told a story of SQL Server trying to reduce its memory foot print rapidly with smaller and smaller physical memory being available. Windows not being able to authenticate new users also explained why SQL Server wouldn’t accept new connections anymore. Based on what we observed, it became clear that the memory issue is not a result of SQL Server, but SQL Server also fell victim of the memory shortage. Since nothing else than SQL Server and the RoboCopy jobs were running, we killed the RoboCopy jobs. An immediate relief was observed. People could log into the server again. SQL Server expanded memory again and SQL Server writing mini-dumps stopped and our SAP ERP system became responsive again.
Sure this triggered investigations. Our investigations basically ended when we found this KBA article:
http://support.microsoft.com/kb/920739/ ‘You may experience a decrease in overall system performance when you are copying files that are larger than approximately 500 MB in Windows Server 2003 SP1 or in Windows Server 2003 SP’
What was described in this article basically matched our scenario. Sure our single files were not that large (certainly larger than 500MB with around 200GB each), but we also tried to copy 4 or 6 of them over a slow line (looking like slow disks to the OS). At the end the Cache Manager couldn’t flush the pages fast enough to the ‘slow disk’. Since the limitation of having dirty pages in the cache manager is set to half of the memory which in the case of our server is 32GB, the cache manager expanded and expanded and ate up all the memory until the system was as good as starved from memory. Solution for us was to set the SystemCacheDirtyPageThreshold to 3GB via Registry key as described in the KBA. Since this setting is alive, we didn’t have any problems copying larger backup files without bottlenecking the server on memory.
Thanks Amit Ganguli from Microsoft IT for doing extensive tests and investigations around this issue and giving input to this article.