My favorite perfcounters

Article
06/06/2017

… are the NDIS counters. Like you even had to ask.

We love performance counters on the Windows Networking team. We routinely use performance counters to monitor & diagnose issues. Nearly every networking feature has its own counter set, and Windows ships with far too many counter sets to document here. Fortunately, I don’t have to document them all – this is the NDIS blog, so I’ll lean on my NDIS bias.

These are the Windows Networking counter sets of particular interest to NDIS developers:

Network Adapter and Network Interface
Per Processor Network Interface Card Activity and Per Processor Network Activity Cycles
Physical Network Interface Card Activity
RDMA Activity
Processor Information

Let’s chat a bit about each of these.

The Network Adapter and Network Interface counter sets.

These are very similar. In fact, the only difference is that Network Adapter shows more types of NICs than Network Interface does. Since Network Adapter has more stuff, I generally prefer it over Network Interface. The only reason to use Network Interface is if you need to write code that works on very old OS versions – Network Adapter was “only” added back in 2006.

The data in these counter sets is pulled together from a few sources:

Many counters are populated from GetIfTable2, which itself ultimately boils down to querying OID_GEN_STATISTICS from the NIC driver.
RSC counters: These are calculated internally by TCPIP.
Current Bandwidth: This is the arithmetic mean of the NIC driver’s NDIS_LINK_STATE XmitLinkSpeed and RcvLinkSpeed.
Output Queue Length: This is not currently implemented, and should be ignored.

The Per Processor Network Interface Card Activity and Per Processor Network Activity Cycles counter sets.

What a mouthful. These two counter sets are more advanced than Network Adapter, and you can use these to dig deeper into a performance issue. Your frontline troops should be your Network Adapter counters. But if that’s not enough, it’s time to call in the Per Processor NDIS counters.

As their names suggest, these counters track data per processor. That’s super helpful when diagnosing RSS or interrupt delivery problems. The Activity counter tracks how many times a particular event occurs, like how many packets are received. The Cycles counter tracks how many CPU cycles are spent on a particular task, like how many CPU cycles are spent in a call to the NIC driver’s interrupt handler. These two counter sets work great in tandem – you can typically divide the Cycles data by the Activity data to get cycles-per-operation.

One caveat about interrupts and DPCs: Not all hardware has interrupts (in the traditional sense, at least). Typically only a PCI bus based NIC will use interrupts or DPCs. For all other types of NICs, these counters will report zero. Furthermore, NDIS drivers can choose to either use NDIS APIs (like NdisMRegisterInterruptEx) or to use WDM APIs to handle interrupts and DPCs. If the driver doesn’t use NDIS APIs, then NDIS won’t be able to track the interrupts and DPCs, and these counters will be zero. So these counters may not report values for some vendors’ drivers.

One caveat about the cycle counters: The cycle counters are measured in terms of the nominal clock tick of your processor. On x86 and x64, this is the units of the rdtsc instruction. On arm32, this is the units of the pmccntr register. Cycle counters are not directly comparable between machines with different processors. If your processor throttles itself down to a lower effective clock in order to save power (e.g, Intel’s P-states), then the cycle counter might not be meaningful. You should only make comparisons between cycle counters under carefully-controlled lab conditions.

One final caveat: These counters were added in Windows 7, but their implementation was changed a bit in Windows 8. Although you’ll find counters of the same name on Windows 7, the definitions below apply only to Windows 8 or Windows Server 2012 and later. I do not suggest you use these counters on Windows 7, unless you’ve first carefully verified that the counter seems to give reliable results.

Because these counters are so low-level, it’s worth spending a bit of time explaining what each of them means. Here’s the rundown.

Interrupts/sec: The number of times the MiniportMessageInterrupt or MiniportInterrupt handler was invoked and returned TRUE to indicate that the interrupt was recognized.

Interrupt Cycles/sec: The number of CPU cycles spent in MiniportMessageInterrupt or MiniportInterrupt.

Interrupt DPC Latency Cycles/sec: The number of CPU cycles between the end of an ISR and the start of a DPC. If this value is unexpectedly high, it may indicate that other DPCs or ISRs are starving the NIC’s DPC. Use WPA to see a timeline of all ISRs and DPCs on the CPU.

DPCs Queued/sec: The number of times that a DPC was executed. (The name is a tiny bit misleading; it’s not the number of times a DPC was queued.)

DPCs Queued on Other CPUs/sec: The number of processors that are targeted by NdisMQueueDpcEx, excluding the current processor. For example, suppose a NIC driver is running on processor 5 and calls NdisMQueueDpcEx with a mask that targets processors 4, 5, and 7. Then this counter will increment by 2, for processors 4 and 7. This will often be zero. If it is a very high number, it may indicate that RSS is running in a less-than-optimal mode. Ideally, a NIC that supports RSS will interrupt the correct processor directly, without needing to schedule a cross-processor DPC.

DPCs Deferred/sec: The number of times that NDIS decided to execute a DPC at passive level, due to Receive Side Throttling (RST). Generally you should expect this to be at zero for Server-class machines.

Interrupt DPC Cycles/sec: The number of CPU cycles spent in MiniportMessageInterruptDpc or MiniportInterruptDpc. Typically this includes both send-complete and receive-indicate activity.

Received Packets/sec: Watch out: this one is subtle! It’s the total number of packets that the NDIS indicated to each protocol driver. For example, if the NIC indicates up an IPv4 frame, but there is no IPv4 protocol driver bound to the NIC, then that frame is not counted; the counter will be zero. More interestingly, if there are two protocols bound to the NIC that both want IPv4 frames (e.g., the TCPIP driver and Wireshark’s NPF driver), then each packet gets counted twice. The gotcha is that if you launch Wireshark, this counter will double. Not because you’re receiving more traffic at the physical layer, but because NDIS is indicating the same traffic to twice as many protocol drivers.

Receive Indications/sec: The total number of times that NDIS indicated packets to a protocol driver. You can divide Received Packets/sec by Receive Indications/sec to get the average batch size. Low-performance client NICs might not batch packets at all, in which case the average batch size will be 1.0. Increasing the Interrupt Moderation setting should increase the batch size.

Low Resource Received Packets/sec: This is the total number of packets indicated to a protocol driver while the NDIS_RECEIVE_FLAGS_RESOURCES flag is set. Some NIC drivers never use this flag, and a few NIC drivers will use this flag always. In some cases, NDIS always sets the flag internally. Despite the name, the flag doesn’t necessarily mean that the NIC is low on resources, although some drivers use it to signal that. In general, you’ll need to know exactly how the NIC driver and NDIS use the flag to interpret this counter. Note that the OS may spend more cycles processing a received packet if the NDIS_RECEIVE_FLAG_RESOURCES flag is set, so you should generally prefer that this counter is at or near zero. If your servers typically have this flag at zero, but suddenly it spikes to a high value, that is a red flag that some network driver is spending too long processing received packets, or has leaked received packets.

Low Resource Receive Indications/sec: The number of times that NDIS indicates packets to a protocol driver with the NDIS_RECEIVE_FLAGS_RESOURCES flag.

Stack Receive Indication Cycles/sec: The number of CPU cycles spent in any ProtocolIndicateReceiveNetBufferLists handler.

NDIS Receive Indication Cycles/sec: The number of CPU cycles spent in NdisMIndicateReceiveNetBufferLists. Typically this would include each filter driver, as well as any protocol drivers. However, if a filter driver defers a receive indication, the time spent in in protocol drivers will not be included.

Returned Packets/sec: The number of receive packets that were processed by the OS, and returned to the NIC driver stack. Specifically, the number of packets that were passed to a call to NdisReturnNetBufferLists. This should almost exactly correlate with Received Packets/sec minus Low Resource Received Packets/sec.

Return Packet Calls/sec: The number of times a protocol driver called NdisReturnNetBufferLists.

NDIS Return Packet Cycles/sec: The number of CPU cycles spent in NdisReturnNetBufferLists. Typically this includes all time spent in filter drivers, the NIC driver, and NDIS itself. However, if a filter driver defers the return operation, or the network adapter is in a low power state due to NDIS Selective Suspend, then this counter will not include all CPU cycles.

Miniport Return Packet Cycles/sec: Despite the name, this counter is implemented to be largely similar to NDIS Return Packet Cycles/sec. This counter includes filters drivers, the NIC driver, and NDIS itself; and the same caveats apply.

Sent Packets/sec: The number of packets that were sent via NdisSendNetBufferLists. Note that, in rare cases, an NDIS lightweight filter driver may itself drop or originate new send packets. Those will not be accounted for in this counter; this counter measures the top of the filter stack.

Send Request Calls/sec: The total number of calls to NdisSendNetBufferLists.

Miniport Send Cycles/sec: The total number of CPU cycles spent in MiniportSendNetBufferLists. This includes much of the time sending packets, but will not account for CPU cycles spent in any workitem, timer, interrupt, or DMA callback. Therefore, (like any cycle counter) it is not comparable across different NIC drivers.

NDIS Send Cycles/sec: The total number of CPU cycles spent in NdisSendNetBufferLists. Typically, this includes all time spent in filters, the miniport driver, and NDIS itself. However, if any filter defers a packet and reinjects it later, then those CPU cycles won’t be included here. For example, the built-in QoS Pacer filter defers some NBLs and reinjects them from a timer.

Sent Complete Packets/sec: The total number of packets that were returned to a protocol driver, via ProtocolSendNetBufferListsComplete.

Send Complete Calls/sec: The total number of calls to any ProtocolSendNetBufferListsComplete handler.

Stack Send Complete Cycles/sec: The total number of CPU cycles spent in ProtocolSendNetBufferListsComplete handlers.

NDIS Send Complete Cycles/sec: The total number of CPU cycles spent in NdisMSendNetBufferListsComplete. This may include time spent in filter drivers and protocols, if the filter drivers handle send completion synchronously. In that case, this counter’s value will be slightly larger than the Stack Send Complete Cycles/sec counter’s value, and their difference tells you how many cycles were spent in the filter stack and NDIS itself. However, if a filter driver defers send completion, this counter will not include time spent in protocols.

Build Scatter Gather List Calls/sec: The number of times the NIC driver calls NdisMAllocateNetBufferSGList. Typically this would be comparable to the number of sent packets per second.

Build Scatter Gather Cycles/sec: The number of cycles spent inside NdisMAllocateNetBufferSGList. Note that this may also include the time spent in MiniportProcessSGList, if the map registers are immediately available and the MiniportProcessSGList handler is called inline. However, if MiniportProcessSGList is deferred and called later, this counter does not then include cycles spent waiting or the cycles spent in MiniportProcessSGList. This counter is difficult to interpret, and I generally do not suggest you use it.

RSS Indirection Table Change Calls/sec: The number of times that a protocol driver updates the RSS indirection table on a NIC. Specifically, the number of times that OID_GEN_RECEIVE_SCALE_PARAMETERS is sent and the NDIS_RSS_PARAM_FLAG_ITABLE_UNCHANGED flag is clear. This should typically be a low number – RSS table updates are pure overhead, and ideally they don’t happen often. If you have many short-lived TCP connections and high CPU usage, the OS may need to update the table more often.

Miniport RSS Indirection Table Change Cycles: The total number of cycles spent in a MiniportOidRequest handler, for an OID_GEN_RECIEVE_SCALE_PARAMETERS request that updates the RSS indirection table. Note that this counter can be unreliable if the CPU cycle counter is not synchronized across all CPUs on the system, since indirection table changes are run at PASSIVE_LEVEL and are not affinitized to a particular processor.

Packets Coalesced/sec: The result of OID_PACKET_COALESCING_FILTER_MATCH_COUNT.

Phew! We made it… now on to the next counter set!

The Physical Network Interface Card Activity counter set.

At present, this counter set is mostly useful for NIC drivers that implement NDIS Selective Suspend. In the future, we may add other counters to it, but right now it’s all power management.

Low Power Transitions (Lifetime) : The number of times that a network adapter has entered a low power state (D2 or D3). This can be as result of NDIS Selective Suspend, connected standby, system S3 or S4, or D3-on-disconnect. This counter is reset each time the device is halted (i.e., receives IRP_MN_REMOVE_DEVICE or the system powers off).

% Time Suspended (Lifetime) : The percent of time since the NIC was started that the NIC is in any low power state. For devices that implement NDIS Selective Suspend, this ought to be a highish percentage when the device isn’t under heavy use.

% Time Suspended (Instantaneous) : Similar to the above, except the time interval is equal to your sample interval. E.g., if you use the -SampleInterval parameter to the Get-Counter PowerShell cmdlet to set the sample interval to 5 seconds, then this counter returns what percent of the last 5 seconds that the device was in low power.

Device Power State: Returns the current D-state of the device. If the device is in D0, this counter returns 0. If the device is in D2, this counter returns 2. If the device is in D3, this counter returns… I bet you can guess. This one’s useful for keeping an eye on NDIS Selective Suspend. Toss it into a perfmon graph, and you can see exactly what the D-state is at each moment.

The RDMA Activity counter set.

The individual counters in this one all come from OID_NDK_STATISTICS, which is supplied by the NIC driver.

The Processor Information counter set.

This is not a networking-specific counter set, but it’s invaluable when diagnosing network performance problems. In particular, the % Interrupt Time, % DPC Time, and Interrupts/sec counters are quite useful. Note that Interrupts/sec is a system-wide counter that includes IPIs, so it can be difficult to infer what exactly causes an unexpectedly-large number of interrupts. Fortunately WPA lets you drill into the finest minutiae of any interrupt storm.

Honorable mentions

We’ve covered the counters that are tied to NDIS, but that’s hardly the end of the story. The counter sets listed below are likely of interest to anyone who’s looking at the Windows network stack. Don’t be afraid to rummage through all the counters that come with the OS – there’s some good stuff in there! For example, if you’re investigating a problem with TCP checksum offload, you’ll want to keep an eye on the TCP checksum errors counter.

IPv4 and IPv6
TCPv4 and TCPv6
UDPv4 and UDPv6
WFPv4 and WFPv6
TCPIP Performance Diagnostics and TCPIP Performance Diagnostics (Per-CPU)
Hyper-V Virtual Network Adapter
Hyper-V Virtual Network Adapter Drop Reasons
Hyper-V Virtual Network Adapter VRSS
Hyper-V Virtual Switch
Hyper-V Virtual Switch Port
Hyper-V Virtual Switch Processor
Network Virtualization

Enough! No more lists of counters! Next time we’ll have fewer words and more pictures as we visualize the flow of statistics and counters in Windows.