The Windows Vista TCP/IP stack has made tremendous improvements in its efficiency, taking full advantage of hardware advances (e.g. gigabit networking). As explained by Murari in a previous posting (Advances in Windows TCP/IP Networking), there are a number of bottlenecks that affect TCP throughput. Here, I will give some examples of how we’ve addressed these bottlenecks in the Windows Vista TCP/IP stack.
TCP auto-tuning: At any given time, the amount that TCP can send is governed by three factors: the congestion window, the receive window and the number of bytes available to send. Without using TCP window scaling (which is disabled by default in previous versions of Windows), the maximum receive window a receiver can advertise is 64K bytes. Since the congestion window is usually greater than 64K bytes in high-bandwidth/high-latency networks, the receive window is often the limiting factor if the application is submitting enough data.
In previous versions of Windows, users can work around this problem by setting the TcpWindowSize registry key value. However, TcpWindowSize is a global setting applied to all connections, and it’s often hard for users to know the appropriate window size to set.
To address this issue in Windows Vista, we implemented TCP auto-tuning. It enables TCP window scaling by default and automatically tunes the TCP receive window size based on the bandwidth delay product (BDP) and the rate at which the application reads data from the connection. With TCP auto-tuning, we have seen 1000% (10x) throughput improvements in internal testing over underutilized wide-area network links.
Receive-Side Scaling: Networking stacks face a number of challenges in scaling their receive processing across processors on multi-processor systems. For instance, on previous versions of Windows all packets indicated in a single interrupt service routine (ISR) are typically processed in a single deferred procedure call (DPC) queued to a specific processor to avoid packet reordering. Until the outstanding DPC completes, no more receive indication interrupts can be triggered. As a result, only one processor can be used at any given time for processing received packets for a single network adapter.
Receive-side scaling (RSS) is our solution for this issue in the new networking stack: it enables parallelized processing of received packets on multiple processors, while avoiding packet reordering. It achieves parallelism by allowing ISRs to queue DPCs on multiple processors, enabling packet processing on multiple processors at the same time. It avoids packet reordering by separating packets into flows, and using a single processor for processing all the packets for a given flow. Packets are separated into flows by computing a hash value based on specific fields in each packet, and the resulting hash values are used to select a processor for processing the flow. Using TCP as an example, this approach ensures that all packets belonging to a given TCP connection will be queued to the same processor, in the same order that they were received by the network adapter.
TCP offload: Previous Windows releases already support network task offload for stateless per-packet operations (e.g. LSO, checksum offload etc). In Windows Vista, in addition to the offloads supported on previous Windows releases, we’ve also introduced support for TCP chimney offload. TCP chimney offload enables Windows to offload all TCP processing for a connection to a network adapter. Offloads are initiated on a per-connection basis, based on heuristics. Compared to task offload, TCP chimney offload further reduces networking-related CPU overhead, enabling better overall system performance by freeing up the CPU for other tasks.
We have also responded to customer feedback by making the Windows Vista TCP/IP stack much smarter and more adaptive in a number of scenarios. One such improvement we’ve made is to enable TCP black-hole detection by default in Windows Vista.
Historically, problems due to the presence of black-hole routers have been among the highest product support call generators for the previous Windows networking stacks. To understand why, it’s important to know that TCP/IP relies on ICMP packet-too-big error messages to discover the maximum transmission unit (MTU) for any given connection’s path, so that it can reduce the size of the packets that it sends if they’re too large. If a router along the path does not send back ICMP error messages, or if a firewall drops ICMP error messages, TCP will never find out that its packets are too big. As a result, it will retransmit the packets repeatedly with the same size, up to its maximum number of retransmissions and, when it gets no responses, it will terminate the connection.
Black hole router detection is a mechanism used in this scenario to automatically reduce the size of the packets sent for a connection, based on the current status of the connection, in the absence of feedback from ICMP packet too big error messages. This mechanism was disabled by default in previous versions of Windows, because previous approaches would often yield too many false positives, lowering the packet size unnecessarily and reducing performance. In Windows Vista, our improvements have reduced the likelihood of false positives and, consequently, minimized the adverse performance impact, enabling us to turn on black hole detection by default in the upcoming Beta 2 release.
There are many, many more innovations that we’ve made in the network stack, far more than I can write about in this one posting. Stay tuned for more…
Software Development Engineer, TCP/IP Networking