Yesterday’s post was a simple implementation of TransmitFile. First things first: I need to apologize. I was being a bit sneaky in this case. The code in the example was written to function correctly, and that mislead a lot of people who were trying to find a bug in the code, when there was no bug in the code. There was something wrong with the code, but it was a performance issue, not a correctness issue (although arguably a performance issue is a correctness issue).
The problem was that the implementation didn’t consider the interaction between sending data and the Nagle Algorithm/delayed acks. These algorithms exist to solve two halves of a basic problem with networking protocols.
Whenever you send data from one machine to another, the sender must wait until the receiver acknowledges the request before sending the next buffer (this is a rough approximation, ignoring sliding windows etc). So when you look at the data flow for a send, on the wire, it appears like this:
Client: Send frame 1 to Server
Server: Send acknowledgement for frame 1 to client
Client: Send frame 2 to Server
Server: Send acknowledgement for frame 2 to client
There is a common misconception that the amount of time it takes to send a message from one machine to another is related to the size of the packet. On some networks this is true (dialup comes to mind immediately). But one of the fundamental properties of an Ethernet (or token ring) network is that sending one byte of data takes essentially the same amount of time as sending 1568 bytes of data (the max Ethernet packet size). The critical factor is the number of packets transmitted, not the size of each packet.
As a result, if the client sends small buffers of data, the server has to acknowledge each of the buffers before the client can send the next buffer. The Nagle Algorithm as represented in RFC896 coalesces those small writes from the client into a larger buffer, so that the acknowledgement traffic isn’t as daunting. After 100-200ms of inactivity on the part of the client, the TCP implementation will flush the data out. As a result, if you’re doing small writes to a socket, if nagling is enabled, then you’ll see that your writes don’t appear on the wire until 100-200 milliseconds after the write. That wasn’t what was happening in this case, however.
Now consider what happens with a normal request/response type protocol (like POP3):
Client: Send “USER” to Server
Server: Send Ack for “USER” to Client
Server: Send “+OK” to Client
Client: Send Ack for “+OK” to Server
Client: Send “PASS” to Server
Server: Send Ack for “PASS” to Client
Server: Send “+OK” to Client
Client: Send Ack for “+OK” to Server
Client: Send “UIDL” to Server
Server: Send Ack for “UIDL”
Server: Send “+OK” to client
You’ll notice that there’s twice the number of frames being sent on the wire. And that each ack is immediately followed by a response from the server. Now remember that the acks are just as expensive as the data sends. It turns out that this behavior is known as the “Silly Window Syndrome”, or SWS. RFC1122 (the TCP specification) specifies a solution to SWS:
A TCP SHOULD implement a delayed ACK, but an ACK should not be excessively delayed; in particular, the delay MUST be less than 0.5 seconds, and in a stream of full-sized segments there SHOULD be an ACK for at least every second segment.
So the receiver can (and should) delay acknowledging the send before it sends the ACK so it can piggyback the ack on the response. This behavior is also codified in section 4.2 of RFC2581, TCP Congestion Control. This “piggyback-ack” behavior isn’t unique to TCP/IP by the way; the Netbeui protocol has also had it, for about as long as TCP has.
Now let’s consider the buggy TransmitFile implementation. In that case, the client’s issuing multiple sends, but there’s no response. And since the sends are synchronous, each send can’t begin until after the previous send has completed. And again, because they’re synchronous sends, TCP can’t buffer the data to coalesce the writes. It has no choice but to wait for the acknowledgement before the send is completed.
Compounding this problem is the size of the transmit buffer: 4096 bytes. It turns out that the NT SWS avoidance algorithm delays an acknowledgement every other frame (details are here, on page 34). Once you remove the TCP overhead from the 1568 byte Ethernet packet size, the maximum payload size for TCP on Ethernet is 1460 bytes per frame. That means that a 4096 byte buffer takes 3 frames to transmit. That’s an odd number of frames, so, the receiving system defers the response for 200 milliseconds in the hope that the receiver will respond to the sender.
And that’s the crux of the bug. The buffer size used for writing the file data is will take an odd number of frames, which means that each write takes 200 milliseconds. Simply increasing the buffer size to 8192 bytes will speed it up by several orders of magnitude, because 8192 byte buffers take up an even number of frames (6). This is a hack solution though, a more correct solution uses overlapped I/O to keep the network adapter saturated.
I’ve found that whenever ANYONE says that their network I/O’s are taking too long, the problem is almost always related to one of these two problems (either nagling or delayed acks). And this isn’t a theoretical problem either. Over the last weekend, I decided to write up this series of articles, and when I came in on Monday morning, I was told that there was a performance problem with the Windows Media Connect HTTP server. My first thought was that it was problem with the layer reading data under the server, after that was ruled out, my next thought was that the problem was deferred acks. And after about 15 seconds of looking at an ethereal trace, I realized quickly that that was the problem. Fortunately, the fix was relatively simple, but the problem is VERY real.
This was not an isolated case – about every six months or so, someone posts a message on the internal performance analysis alias asking for help with a networking performance problem, the answer is almost always related to either nagling or delayed acks.
Kudos: This turned out to be harder than was expected. Nicholas Allen came up with the right answer, and Simon Cooke expanded on it quite nicely.
And for the unanticipated bugs:
Simon Cooke pointed out that the algorithm sends 0 byte writes if the file size is exactly a multiple of the buffer size.
Anon pointed out that there’s no check for null buffers. This is critical if there’s no header or trailer specified, the code would crash if the user didn’t specify them.
And, because sockets are opened for overlapped I/O, the synchronous WriteFile call to write to the socket only works if there are no other threads interacting with the socket. So the article should have stipulated that the app was single threaded.
During my research for this article, I ran into a rather nice write-up on SWS and this problem in this article on speeding up web servers.