Ironical Case of a Performance Monitoring Tool causing a Performance Issue!

Hello folks, this is Pushkar and I recently worked an interesting case dealing with high CPU usage. The case was particularly interesting but it was not a business critical application consuming the CPU. I was tasked to identify why a third party Performance Monitoring tool was causing a performance issue on the server by consuming 100% of the CPU cycles. The irony of the issue itself made this particular case immensely interesting for me and I decided to investigate it further. Typically issues with third party products are addressed by the Application Vendor.

In this case the monitoring tool, Monitor.exe (disclaimer: Identity of the actual product has been removed), was consistently consuming 100% of the CPU cycles, and as expected if the tool was stopped, the CPU usage was back to normal. As I didn’t have information about how the tool worked, I decided to gather data to give me an idea about the underlying API's the tool was calling. This would give me an understanding of its behavior.

To begin with the troubleshooting I gathered Kernrate.exe logs from the server along with a memory dump. On platforms which are Windows Vista and upwards you can use the Windows Performance Toolkit (aka. Xperf), a better alternative to Kernrate.

Note: To learn more about kernrate.exe check here

A quick review of the Kernrate Log showed a lot of CPU time in the Kernel, and the function mostly called was NtQueryDirectoryObject()

TOTAL K 0:07:35.343 (49.2%) U 0:02:40.734 (17.4%) I 0:05:09.171 (33.4%) DPC 0:00:52.984 (5.7%) Interrupt 0:00:20.312 (2.2%)

       Total Interrupts= 4208713, Total Interrupt Rate= 9098/sec.

-----------------------------

Results for Kernel Mode:

-----------------------------

OutputResults: KernelModuleCount = 106

Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime 116300 hits, 65536 events per hit --------

 Module Hits msec %Total Events/Sec

PROCESSR 45160 462654 38 % 6397017

NTOSKRNL 43573 462654 37 % 6172215

SYMMPI 18258 462654 15 % 2586287

----- Zoomed module NTOSKRNL.EXE (Bucket size = 16 bytes, Rounding Down) --------

Percentage in the following table is based on the Total Hits for this Zoom Module

ProfileTime 43573 hits, 65536 events per hit --------

 Module Hits msec %Total Events/Sec

NtQueryDirectoryObject 8433 462654 19 % 1194553

memmove 6584 462654 14 % 932638

KeInsertQueueDpc 5593 462654 12 % 792261

KeFlushMultipleTb 3118 462654 7 % 441671

 

Further Investigation of a dump gathered indicated the function listed above was invoked to parse through the \BaseNamedObjects namespace in the kernel. BaseNamedObjects is a folder in the Kernel Object Namespace where various kernel objects: events, semaphores, mutexes, waitable timers, file-mapping objects, and job objects are created. The purpose of such global namespace is to enable processes on multiple client sessions to communicate with a service application. Another use of such namespace is for applications that use named objects to detect that there is already an instance of the application running in the system across all sessions.

I started looking under the \BaseNamedObjects namespace in the kernel and found there were numerous, over 2900 objects under that Namespace. You can use SysInternal tool WinObj.exe to browse through the Kernel Object Namespace and the list of objects within those namespaces. In my case I had no prior idea of the cause of the issue so I had to look at the dump file. I pasted a snapshot of the WinObj.exe ouput below along with a stack from the kernel dump.

WinObj Screenshot

kd> !object \BaseNamedObjects

Object: fffffa8000d72ec0 Type: (fffffadfe7acb1e0) Directory

    ObjectHeader: fffffa8000d72e90 (old version)

    HandleCount: 36 PointerCount: 10275 <- This indicates the Number of Object under this namespace

    Directory Object: fffffa8000004060 Name: BaseNamedObjects

    Hash Address Type Name

    ---- ------- ---- ----

     00 fffffadfe5d79660 Job TestJobObj_9920

         fffffadfe5d7a8e0 Job TestJobObj_9913

         fffffadfe5d7b8e0 Job TestJobObj_9907

         fffffadfe5d84060 Job TestJobObj_9850

         fffffadfe5d863e0 Job TestJobObj_9843

         fffffadfe5d873e0 Job TestJobObj_9837

         fffffadfe5d8e060 Job TestJobObj_9790

         fffffadfe5d903e0 Job TestJobObj_9783

         fffffadfe5d913e0 Job TestJobObj_9777

         fffffadfe5dad660 Job TestJobObj_9611

         fffffadfe5dae660 Job TestJobObj_9605

        fffffadfe5db7660 Job TestJobObj_9551

         fffffadfe5db8660 Job TestJobObj_9545

         fffffadfe5db98e0 Job TestJobObj_9538

<Snipped>

In my case the third party performance monitoring tool was experiencing high CPU when they queryied “Job Objects” and “Job Object Details” performance counters.

Since this was happening with a third party tool, I tried to recreate the issue in-house, and wrote a small utility to create a lot of Named Job Objects under \BaseNamedObjects namespace using CreateJobObject() API. Then I tried to use the plain old Perfmon tool built into Windows. The moment I tried to add the “Job Objects” & “Job Object Details” counters, the CPU utilization from MMC.EXE hit 90% (Perfmon runs as an MMC). Now I could reproduce the issue in-house so I investigated what happened when we try to query those counters.

Here is what I found:

When we attempt to query the Job Object or Job Object Details performance counter it invokes the following functions respectively in Perfproc.dll which would invoke NtQueryDirectoryObject() to locate the Object under the \BaseNamedObjects Namespace. Then it calls QueryInformationJobObject to gather the performance data related to that objects. One of the thread stacks is given below to show how the flow goes through

 

CollectJobDetailData()

CollectJobObjectData()

 

kd> kb

RetAddr : Args to Child : Call Site

fffff800`01040abd : 00000000`000ad490 fffff800`01385b47 00000000`00002990 00000000`00000000 : nt!NtQueryInformationJobObject

00000000`77ef171a : 00000000`77d423cb 00000000`000ad498 00000000`00000000 00000000`c0000135 : nt!KiSystemServiceCopyEnd+0x3

00000000`77d423cb : 00000000`000ad498 00000000`00000000 00000000`c0000135 00000000`00000000 : ntdll!ZwQueryInformationJobObject+0xa

000007ff`5a794a16 : 00000000`000ad990 00000000`00000008 00000000`00000400 00000000`000addf8 : kernel32!QueryInformationJobObject+0x77

000007ff`5a7932b9 : 00000000`0000dd20 000007ff`5a79aa84 00000000`00000001 00000000`00000040 : perfproc!CollectJobObjectData+0x356

000007ff`7feeb497 : 00000000`00000000 00000000`000adee0 00000000`00000000 00000000`00000001 : perfproc!CollectSysProcessObjectData+0x1f9

000007ff`7fef09d1 : 00000000`000ae1d0 00000000`00000000 00000000`0000007a 00000000`00000000 : ADVAPI32!QueryExtensibleData+0x951

000007ff`7fef0655 : 0039002d`00350038 00000000`77c43cbd 00000000`00000000 00000000`00000000 : ADVAPI32!PerfRegQueryValue+0x66d

000007ff`7ff0b787 : 00000000`00000000 00000000`00000000 00000000`000ae8a0 00000000`000ae888 : ADVAPI32!LocalBaseRegQueryValue+0x356

000007ff`5b17ba27 : 00009ae8`b8ee62a9 00000000`000c46a0 00000000`00200000 00000000`00000000 : ADVAPI32!RegQueryValueExW+0xe9

 

You can determine the type of performance data queried for each Job Object in Win32_PerfFormattedData_PerfProc_JobObject Class. Although it documents a WMI interface for querying the Performance Data, it represents the same data set the native API also queries.

Once I understood what was happening under the hood, it was easy to deduce that every iteration of the query will be a CPU intensive operation. For each object we are going to do a recursive query for the Performance Data as mentioned in the above document. If the number of items is large, and we are running a broader query of the nature “All Counters” & “All Instances” then every single iteration would consume a lot of CPU. To validate this, I got help from one of my colleagues, Steve Heller, who modified the sample code received from the vendor of the Monitoring Tool (Which demonstrated what they were doing). We noticed they were querying this information every 2.5 seconds. The sample tool demonstrated that with about 10,000 Job Objects, one single iteration of the query was taking roughly 12.5 seconds to complete. No wonder the CPU usage would remain high, because before we finish a single iteration of the query for all the Job Objects, four additional queries were already queued and the CPU Usage continued to remain at 90% or more.

The conclusion that can be drawn from this test is that querying \BaseNamedObjects namespace with a large number of objects under it will invariably result in a fair amount of CPU usage. This could get worse if the query is performed for complex objects e.g., Job Objects for which we would run nested queries for individual Performance Data for each Job Object, and if the data is queried too frequently.

The two aspects responsible for the high CPU usage are:

1. The frequency at which the Performance Data was being queried

2. The number of objects under \BaseNamedObjects namespace.

 

Though there is no defined limitation on the number of objects that can be created under various Kernel Object Namespaces, absence of such documents doesn’t mean that we should simply create a lot of objects. Applications should be designed carefully to use such kernel resources judiciously.

At the same time because we can query a lot of performance data as the interfaces are publicly available, doesn’t mean we should query them at a very high frequency. Certain operations can be CPU intensive as we have seen here and we should identify what we really want to query and at what frequency.

Quoting for the MSDN article Job Objects

  “A job object allows groups of processes to be managed as a unit. Job objects are namable, securable, sharable objects that control attributes of the processes associated with them. Operations performed on the job object affect all processes associated with the job object.”  

The above point clearly indicates that the Job Object Framework is intended for processes which are participating in a Job to monitor and control each other’s resource usage.

There could be different reason for which a system administrator needs to gather information about the resource usage by a particular Process. One of them could be to meet Service Level Agreement. In that case we can monitor individual processes running on the server with the counters available under “Performance Object -> Process”.

Equipped with the knowledge that querying Performance Data for Job Object may result in excessive CPU usage on the server, an Administrator should evaluate or consult the application owner (If needed) to understand whether they really want to gather this performance data.

The following documents are available for application developers to assist with the creation of application specific namespaces. Creation of a private namespace and creating objects under that namespace would ensure that application specific objects are not exposed to the Performance Library because they won’t appear under \BaseNamedObjects namespace.

Object Namespaces

https://msdn.microsoft.com/en-us/library/ms684295(VS.85).aspx  

 

CreatePrivateNamespace Function

https://msdn.microsoft.com/en-us/library/ms682419(VS.85).aspx

 

Job Objects

https://msdn.microsoft.com/en-us/library/ms684161(VS.85).aspx

CreateJobObjects Function

https://msdn.microsoft.com/en-us/library/ms682409(VS.85).aspx

 

This problem would be predominantly noticed on older systems. To mitigate it to a certain extent, Windows Vista (and all post operating systems) have a per-session BaseNamedObjects namespace (i.e., it application-specific objects will be created under namespace in the following form \Sessions\<SessionID>\BaseNamedObjects).Hence these objects will not be exposed to the Performance Libraries and will not be available to query via Performance Monitoring Tools. Unfortunately the changes which were done to the kernel to make this work are too intricate to be back-ported to Windows 2003 SP2.

This brings me to the end of this post, I hope this information gives you better insight on the how oversight in an applications design can lead to the breaking of some other components at a later stage.