The consequences of ignoring nagling and delayed acks

Over most of this week, I’ve discussed how ignoring the underlying network architecture can radically hurt an application.  Now it’s time for a war story about how things can go awry if you don’t notice these things.

One of the basic limitations of networking is that you really shouldn’t send a server more data than it is expecting.  At a minimum, it’s likely that your connection will be flow-controlled.  This is especially important when you’re dealing with NetBIOS semantics.  Unlike stream-based sockets (like TCP), NetBIOS requires message-based semantics.  This means that if the transmitter sends a buffer that is larger than the buffer that the receiver is prepared to accept, the send will fail.  As a result, the SMB protocol has the concept of a “negotiated buffer size”.  Typically this buffer size is  about 4K.

Lan Manager 1.0 had a bunch of really cool new enhancements to the basic SMB protocol.  One of the neatest ones (which was used for its IPC mechanism) was the “transaction” SMB.  The idea behind the transaction SMB was to enable application-driven large (up to 64K :) ) transaction.  The protocol flow for the transaction SMB went roughly like this:

            Client:  Request Transaction, sending <n> bytes, receiving <m> bytes
            Server: Ok, buffers allocated to receive <n> bytes, go ahead
            Client: Sending 4K block 1
            Client: Sending 4K block 2
            Client: Sending 4K block <n>
            <The server does it’s thing and responds>
            Server: Response 4K block 1
            Server: Response 4K block 2
            Server: Response 4K block 3
            Server: Response 4K block <n>

The idea was that the client would “shotgun” the sends asynchronously, and as quickly as possible to the server, and the server would do the same.  The thinking was that if the transmitter had lots of outstanding sends, then the transport would deliver the data as quickly as possible. 

It looks good at this level.  But if you were following the discussion on the earlier posts, you should have some red flags raised by now.  Now lets consider what happens to the code at the network layer:

            Client:  Request Transaction, sending <n> bytes, receiving <m> bytes
            Server: ACK Request
            Server: Ok, buffers allocated to receive <n> bytes, go ahead
            Client: ACK Request
            Client: Sending 4K block 1, frame 1
            Client: Sending 4K block 1, frame 2
            Client: Sending 4K block 1, frame 3
            Server: ACK Request
            Client: Sending 4K block 2, frame 1
            Client: Sending 4K block 2, frame 2
            Client: Sending 4K block 2, frame 3
            Server: ACK Request
            Client: Sending 4K block <n>, frame 1
            Client: Sending 4K block <n>, frame 2
            Client: Sending 4K block <n>, frame 3
            Server: ACK Request
            <The server does it’s thing and responds>
            Server: Sending 4K block 1, frame 1
            Server: Sending 4K block 1, frame 2
            Server: Sending 4K block 1, frame 3
            Client: ACK Request
            Server: Sending 4K block 2, frame 1
            Server: Sending 4K block 2, frame 2
            Server: Sending 4K block 2, frame 3
            Client: ACK Request
            Server: Sending 4K block 3, frame 1
            Server: Sending 4K block 3, frame 2
            Server: Sending 4K block 3, frame 3
            Client: ACK Request
            Server: Sending 4K block <n>, frame 1
            Server: Sending 4K block <n>, frame 2
            Server: Sending 4K block <n>, frame 3
            Server: ACK Request

Well, that’s a lot more traffic, but nothing outrageous, on the other hand, that idea about multiple async sends being able to fill the pipeline clearly went away – the second send doesn’t start until the first is acknowledged.  In addition, the sliding window never gets greater than 3K but that’s not the end of the world…

But see what happens when we add in delayed (or piggybacked) acks to the picture…  Remember, CIFS uses NetBIOS semantics.  That means that every byte of every send must be acknowledged before the next block can be sent.

            Client:  Request Transaction, sending <n> bytes, receiving <m> bytes
            Server: ACK Request
            Server: Ok, buffers allocated to receive <n> bytes, go ahead
            Client: ACK Request
            Client: Sending 4K block 1, frame 1
            Client: Sending 4K block 1, frame 2
            Client: Sending 4K block 1, frame 3
            Server: wait 200ms for server response and ACK Request
            Client: Sending 4K block 2, frame 1
            Client: Sending 4K block 2, frame 2
            Client: Sending 4K block 2, frame 3
            Server: wait 200ms for server response and ACK Request
            Client: Sending 4K block <n>, frame 1
            Client: Sending 4K block <n>, frame 2
            Client: Sending 4K block <n>, frame 3
            Server: wait 200ms for server response and ACK Request
            <The server does it’s thing and responds>
            Server: Sending 4K block 1, frame 1
            Server: Sending 4K block 1, frame 2
            Server: Sending 4K block 1, frame 3
            Client: wait 200ms for client request and ACK Request
            Server: Sending 4K block 2, frame 1
            Server: Sending 4K block 2, frame 2
            Server: Sending 4K block 2, frame 3
            Client: wait 200ms for client request and ACK Request
            Server: Sending 4K block 3, frame 1
            Server: Sending 4K block 3, frame 2
            Server: Sending 4K block 3, frame 3
            Client: wait 200ms for client request and ACK Request
            Server: Sending 4K block <n>, frame 1
            Server: Sending 4K block <n>, frame 2
            Server: Sending 4K block <n>, frame 3
            Server: wait 200ms for client request and ACK Request

All of a sudden, an operation that looked really good at the high level protocol overview turned into an absolute nightmare on the wire.  It would take over a second just to send and receive 28K of data!

This is the consequence of not understanding how the lower levels behave when you design higher level protocols.  If you don’t know what’s going to happen on the wire, design decisions that look good at a high level turn out to be terrible when put into practice.  In many ways, this is another example of Joel Spolsky’s Law of Leaky Abstractions – the network layer abstraction leaked all the way up to the application layer.

My solution to this problem when we first encountered it (back before NT 3.1 shipped) was to add the TDI_SEND_NO_RESPONSE_EXPECTED flag to the TdiBuildSend API that would instruct the transport that no response was expected for the request.  The transport would then disable delayed acks for the request (if it was possible).  Now for some transports it’s not possible to disable piggyback acks, but for those that can, this is a huge optimization.