Ninja Troubleshooting- User Interface freezing during SQL Server database restore.

Hello world :)

To sum up, i was working with one of my customers regarding the UI (User Interface) freezing while restoring (Symantec NetBackup) a database into SQL Server 2012, leading to a unresponsive behavior from the server.

Scenario (VMware guest – WS2012R2):

Started with gathering performance counters to jump into the analysis and even though there wasn’t clear CPU pressure (all green above :) ), found spikes that matched the UI Freeze on a specific CPU (11) as observed bellow:

Basically the Processor % Privileged Time indicates the percentage of time a thread runs in privileged mode also known as kernel mode. When your application calls operating system functions (for example to perform file or network I/O or to allocate memory), these operating system functions are executed in privileged mode. I recommend using the Performance Analysis of Logs (PAL) Tool to assist on performance counter analysis.

The PFE (Premier Field Engineering) has the culture to do Root Cause Analysis…So... Let’s dig a little deeper :) , using WPR (Windows Performance Recorder) that is included on Windows Assessment and Deployment Kit and do a trace while the issue is being reproduced…

Analyze the trace with Windows Performance Analyzer:

First trace, without any change on the Operating System:
Since we want to focus on % Privilege Time, we want to analyze the DPC Duration by Module and Processor:

 

Hummm… Interesting… The NDIS.SYS is taking most of compute time, more precisely the function ndisInterruptDpc , that leaded to RSS - Receive Side Scaling (improves the system performance related to handling of network data on multiprocessor systems)

Since this is a virtualized multiprocessor system, it was to time check if RSS was enabled on the NIC used for Backup (https://technet.microsoft.com/en-us/library/dn383582.aspx), it wasn’t enabled.

At this point, it was clear that this Vcore (CPU 11) wasn’t able to cope with all the Network Related Workload. 

Next step, enable RSS on NIC used for backup, reproduce the workload, trace it and analyze again in WPA, the result can be clearly observed in the data bellow:

 

Conclusion, when RSS is enabled on the Backup NIC, the Network Stack workload is scaled out by 4 CPUs thus reducing the “spike workload” from +/-5 seconds to +/- 1 second spread across 4 CPUS and no longer having UI Freezes.

Interesting that VMware also has an article related with this situation:
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2008925

“Receive Side Scaling (RSS) RSS is a mechanism which allows the network driver to spread incoming TCP traffic across multiple CPUs, resulting in increased multi-core efficiency and processor cache utilization. If the driver or the operating system is not capable of using RSS, or if RSS is disabled, all incoming network traffic is handled by only one CPU. In this situation, a single CPU can be the bottleneck for the network while other CPUs might remain idle.“

Tools used:
Performance Counters
Performance Analysis of Logs
Windows Performance Recorder
Windows Performance Analyzer

Regards,
Paulo Condeça