Mainstream NUMA and the TCP/IP stack: Part I.

Article
06/09/2008

One of the intriguing aspects of the onset of the many-core processor era is the necessity of using parallel programming techniques to reap the performance benefits of this and future generations of processor chips. Instead of significantly faster processors, we are getting more of them packaged on a single chip. To build the cost-effective mid-range blade servers configured in huge server farms to drive today’s Internet-based applications, the hardware manufacturers are tying together these complex multiprocessor chips to create NUMA architecture machines. There is nothing the matter with NUMA – machines with non-uniform memory access speeds – of course, other than the fact that they introduce complex, hardware-specific programming models if you want to build applications that can harness their performance and capacity effectively. What is decidedly new is the extent to which previously esoteric NUMA architecture machines are becoming mainstream building blocks for current and future application servers. For the connected applications of the future, our ability to build programming models that help server application developers deal with complex NUMA architecture performance considerations is the singular challenge of the many-core era.

In this blog entry, I will discuss the way both these trends -- multi-core processors and mainstream NUMA architectures – come together to influence the way high speed internetworking works today on servers of various sorts that need to handle a high volume of TCP/IP traffic. These include IIS web servers, Terminal Servers, SQL Servers, Exchange servers, Office Communicator servers, and others. Profound changes were necessary in the TCP/IP networking stack in both Windows Server 2008 and the Microsoft Windows Server 2003 Scalable Networking Pack release to scale effectively on multi-processor machines. These changes are associated with a technology known as Receive-Side Scaling, or RSS. RSS has serious performance implications on the architecture of highly scalable server applications that sit atop the TCP/IP stack in connected system environments.

Let’s start by considering what is happening to the TCP/IP software stack in Windows to support high speed networking, which is depicted in Figure 1.

NDIS TCP/IP protocol stack

The Internet Protocol (IP) and the Transmission Control Program (TCP) are the standardized software layers that sit atop the networking hardware. The Ethernet protocol is the pervasive Media Access (MAC) Layer that segregates the transmission of digital bits into individual packets. Performance issues with Ethernet arise due to the relatively small size of each packet. The Maximum Transmission Unit (MTU) for standard Ethernet sized packets is 1500 bytes. Any messages that are larger than the MTU require segmentation to fit in standard sized Ethernet packets. (Segmentation on the Send side and reassembly on the Receive side are functions performed by the next higher level protocol in the stack, namely the IP layer.) Not all transmissions are maximum sized packets. For example, the Acknowledgement (ACK) packets required and frequently issued in TCP consist of 50-byte packet headers only, with no additional data payload. On average, the size of packets circulating over the Internet is actually much less than the protocol-defined MTU.

The performance problems arise because in a basic networking scheme, each packet received by the Network Interface Card (NIC) delivers a hardware interrupt to the host processor, requiring in turn some associated processing time on the host computer to service that interrupt. The TCP/IP protocol is reasonably complex, so the amount of host processing per interrupt is considerable.[1] As transmission bit rates have increased from 10 Mb to 100 Mb to 1 Gb to today’s 10 Gb NICs, the potential interrupt rate rises proportionally. The host CPU load associated with processing network interrupts is a long-standing issue in the world of high speed networking. The problem has taken on a new dimension in the many-core era because the network interface continues to get faster, but processor speeds are no longer keeping pace.

A back-of-the envelope calculation to figure out how many interrupts/sec a host computer with a 10 Gb Ethernet card potentially needs to handle should illustrate the scope of the problem. The Ethernet wire protocol specifies a redundant coding scheme that encodes successive eight bits of data with two bits of error correction data. This is known as 10/8 encoding. With 10/8 encoding, a 10 Megabit Ethernet card has a nominal data rate of 1 Megabyte/sec, a 100 Mb Ethernet NIC transmits data at a 10 MB rate, etc. Similarly, the 10 Gb Ethernet card has the capacity to transmit application data at 1 GB/sec.

To understand the rate interrupts need to be processed on a host computer system to sustain 1 GB/second throughput with Ethernet, simply divide by the average packet size. To keep the math easy, assume an average packet size is 1k or less bytes. (This is not an outlandish assumption. A large portion of the Receive packets processed at a typical web server are TCP ACKs; these are minimum 50 byte headers-only packets. Meanwhile, http Get Requests containing a URL, a cookie value, and other optional parameters can usually fit in a single Ethernet 1500-byte packet -- in practice, the cookie data that most web applications store is often less than 1 KB.) Assuming an average packet size of 1 KB, a 10 Gb Ethernet card that can transfer data at 1 GB/sec rate has the capability of generating 1 million operations/sec on your networking server. Next, assume there is a 1:1 ratio of Send:Receive packets. If 50% of those are Receive operations, then the machine needs to support 500K interrupts/sec.

Now, if the number of instructions associated with network interrupt processing in the Interrupt Service Routine (ISR) associated with the device, the Deferred Procedure Call (DPC), and the next higher layers in the Network Device Interface Specification (NDIS) stack to support TCP/IP is, let’s say 10,000, then the processor load to service TCP/IP networking requests is:

500,000 interrupts/sec * 10,000 instructions = 5,000,000,000 instructions/second (1)

which easily exceeds the capacity of a single CPU in the many-core era. If these network interrupts are confined to a single processor, which is the way things worked in days of yore, host processor speed is a bottleneck that will constrain the performance of a high speed NIC.

Of course, instead of wishing and hoping that TCP interrupt processing could be accomplished within 10K instructions in today’s complex networking environment, it might help to actually try and measure the CPU path length associated with this processing. To measure the impact of the current TCP/IP stack in Windows Vista, I installed the NTttcp Test tool available here and set up a simple test using the 1 Gb Ethernet NIC installed on a dual-core 2.2 GHz machine running Windows Vista SP1 over a dedicated Gigabit Ethernet network segment. Since the goal of the test was not to maximize network throughput, I specified 512-byte sized packets and was careful to confine the TCP interrupt processing to CPU 0 using the following NTttcp parameters:

ntttcpr -m 1,0,192.168.3.51 -a 16 -l 512 -mb -fr -t 120

I was also careful to shut down all other networking applications on my machine for the duration of the test.

Here’s the output from a 120 second NTttcp run, allowing for both a warm-up and cool down period wrapped around the main test:

Throughput(KB/s)	16,475.553
Throughput(Mbit/s)	131.804
Average Frame Size	764.394
Packets Sent	1,309,892
Packets Received	2,586,923
Packets received/Int)	2
Interrupts/sec	9,494.04
Cycles/Byte	129.3

On the dual-core machine, CPU 0 was maxed out at 100% for the duration of the test – evidently, that was the capacity of the machine to Receive TCP/IP packets and process and return the necessary Acknowledgement packets to the Sender. I will drill into the CPU usage statistics in a moment. For now, let’s focus on the interrupt rate, which was about 9500 interrupts/sec or slightly more than 100 μsecs of processing time for each Interrupt processed. This being a 2.2 GHz machine, 100 μsecs of processing time translates into 220,000 cycles of execution time per TCP/IP interrupt. Substituting this more realistic estimate of the CPU path length into equation 1 yields

500,000 interrupts/sec * 200,000 clocks = 100,000,000,000 instructions/second (2)

a requirement for 100 GHz of host processing power to perform the TCP/IP processing for a 10 Gb Ethernet card running at its full rated capacity.

Next, I re-executed the test while running the xperf ETW utility that is packaged with the Windows Performance Toolkit to capture CPU consumption by the TCP/IP stack:

xperf -on LATENCY -f tcpreceive1.etl –ClockType Cycle

According to the xperf documentation, the LATENCY flags request trace data that includes all CPU context switches, interrupts (Interrupt Service Routines or ISRs) and Deferred Procedure calls (DPCs). As explained in [1], Windows uses a two-step process to service device interrupts. Initially, the OS dispatches an ISR to service the specific device interrupt. During the ISR, further interrupts by the device are disabled. Ideally, the ISR performs the minimum amount of processing time possible to re-enable the device for interrupts and then schedules a DPC to finish the job. DPCs are dispatched at a lower priority than ISRs, but above all other functions in the machine. DPCs execute with the device re-enabled for interrupts so it is possible for the execution time of the DPC to be delayed because it is preempted by the need to service a higher priority interrupt from the NIC (or another device).

Gathering the xperf data while the NTttcp test was running lowered the network throughput only slightly – by less than 2%, additional measurement noise that can safely be ignored in this context. The kernel trace events requested are basically being gathered continuously by all the diagnostic infrastructure in Windows anyway. The xperf session merely gathers them from memory and writes them to disk. The disk was otherwise not being used for anything else during this test and there was an idle CPU available to handle the tracing chores. The overall performance impact of gathering the trace data was minimally disruptive in this situation.

I then loaded the trace data from the .etl file and used the xperview GUI application to analyze it. See Figure 2.

NDIS DPC protocol overhead

Figure 2. xperview display showing % CPU utilization, % DPC time, and % Interrupt time, calculated from the kernel trace event data recorded during the NTttcp test execution.

Figure 2 shows three views of the activity on CPU 0 where all the networking processing was performed. The top view shows overall processor utilization at close to 100% during the TCP test, with an overlay of a second line graph indicating the portion specifically associated with DPC processing, accounting for somewhere in excess of 60% busy. The DPC data is broken out and displayed separately in the middle graph, and the Interrupt CPU time is shown at the bottom (a little less than 4%).

xperfview allows you to display a Summary Table that breaks out Interrupt and DPC processor utilization at the driver level, sorted by the amount of processor time spent per module. For the DPCs, we see the following.

Module	Function	Count	Max Duration [ms]	Avg Duration [ms]	Duration [ms]
ndis.sys		425125	1.116595	0.075376	32044.44967
	0x8ac79237	423800	0.797987	0.075552	32019.18583
	0x8ad38209	207	1.116595	0.11752	24.326752
	0x8ad3892f	1117	0.012439	0.000837	0.935867
	0x8ad399b3	1	0.001213	0.001213	0.001213
USBPORT.SYS		8312	0.064506	0.011802	98.100399
tcpip.sys		4154	0.551394	0.009585	39.817004
dxgkrnl.sys	0x8f34e09b	3039	0.528346	0.012848	39.047187
iastor.sys		1221	0.033546	0.015061	18.390545

Table 1. xperfview Summary Table display showing processor utilization by DPC.

Confirming my back-of-the-envelope calculation presented earlier, xperf trace data indicates the average duration of an ndis.sys DPC used to process a network interrupt was 75 μsecs. The total amount of time spent in DPC processing was approximately 32 seconds of the full trace, which lasted about 52 seconds, corresponding to slightly more than 61% busy on CPU 0.

Module	Function	Count	Max Duration [ms]	Avg Duration [ms]	Duration [ms]
ntkrnlpa.exe	0x828d6fa2	423803	0.023875	0.003173	1345.147547
dxgkrnl.sys	0x8f3630ea	3040	0.096759	0.039523	120.151742
USBPORT.SYS	0x8f4098c2	5529	0.025199	0.007241	40.038229
pcmcia.sys	0x82f8deea	4968	0.02401	0.006754	33.558272
iastor.sys	0x8aaa7f6c	4968	0.016345	0.005103	25.353482

Table 2. xperfview Summary Table display showing processor utilization by ISR.

The Summary Table display reproduced in Table 2 serves to confirm a direct relationship between the ndis DPC processing and the kernel mode interrupts processed by the ntkrnlpa ISR. The average duration of an ntkrnlpa ISR execution was just 3 μsecs. Together, the ISR+DPC time was just under 80 μsecs. This leads to a slight downward revision of equation 2:

500,000 interrupts/sec * 175,000 clocks = 88,000,000,000 instructions/second (3)

which remains a formidable constraint, considering the speed of current processors.

Finally, I drilled in the processor utilization by process, which showed utilization by the NTttcp process, whose main processing thread was also affinitized to CPU 0, at the receiving end of the interrupt responsible for an additional 11% CPU busy. Allowing for OS scheduling and other overhead factors, these three workloads account for the 100% utilization of CPU 0.

Process	Cpu Usage (ms)	% Cpu Usage
Idle (0)	32730.11165	50.4
NTttcpr.exe (3280)	7120.518872	10.97
services.exe (700)	582.050622	0.9
InoRT.exe (2076)	368.519663	0.57
dwm.exe (4428)	300.59808	0.46
System (4)	256.802505	0.4
svchost.exe (1100)	166.479679	0.26
taskmgr.exe (7092)	162.960941	0.25
msiexec.exe (6412)	145.122854	0.22
sidebar.exe (5884)	111.86257	0.17
WmiPrvSE.exe (7952)	93.797334	0.14
csrss.exe (668)	79.331035	0.12
svchost.exe (1852)	70.65799	0.11

Table 3. xperfview Summary Table display showing processor utilization by process.

Note that the NTttcp program responsible for processing the packets it receives probably represents a network application that performs the minimum amount of application-specific processing per packet that you can expect. It ignores the data payload contents completely, and its other processing of the packet is pretty much confined to maintaining its throughput statistics. We should also note that is a user mode application, which means that processing the receive packet does require a transition from kernel to user mode. It is possible to implement a kernel mode networking application in Windows – the http.sys kernel model driver that IIS uses is one – that avoids these expensive processor execution state transitions, but they are the exception, not the rule. (And, when it comes to building HTTP Response messages dynamically using ASP.NET, even http.sys hands off the HTTP Request packet to an ASP.NET user mode thread for processing.)

The point of this set of measurements and calculations is not characterize the network traffic in and out of a typical web server, but to understand the motivation for recent architectural changes in the networking stack – both hardware and software – to allow network interrupts to be processed concurrently on multiple processors. Those architectural changes are the subject of Part II of this blog.

Part II of this article is posted here.

[1] Some have argued otherwise. See D. D. Clark, V. Jacobson, J. Romkey and H. Salwen, “An analysis of TCP processing overhead.” IEEE Communications Magazine, 27(6):23-29, June 1989. This assessment was made prior to the overhaul of the TCP protocol proposed by Van Jacobson that was implemented to address serious scalability issues that Internet technology faced in the early years of its adoption. Taking account of both security and performance considerations, the TCP/IP protocol software stack as implemented today is considerably more complex. Microsoft Windows Server 2003 TCP/IP Protocols and Services Technical Reference by Davies and Lee is a useful guide to the full set of TCP/IP services that are provided today, except it does not include the additional functions in the Microsoft Windows Server 2003 Scalable Networking Pack release discussed here. For a recent description of TCP/IP host processor overhead, see Hyun-Wook Jin, Chuck Yoo, “Impact of protocol overheads on network throughput over high-speed interconnects: measurement, analysis, and improvement.” The Journal of Supercomputing 41(1): 17-40 (2007

Mainstream NUMA and the TCP/IP stack: Part I.

Additional resources