In the first part we talked briefly about KITL timeouts. In this article we are going to give you a little bit more details and try to explain why it behaves the way it does.
How did we run into this issue?
It all started with Hopper (apparently Hopper is also good at finding KITL problems as well as all the other problems it usually finds :-). When we ran Hopper on one of the devices, it would behave in a very strange way. Sometimes it would start spewing a lot of data into the debug output and stop completely, the other times it would keep going, sometimes it would stop for 20 minutes and then continue like nothing happened. And when debug output stopped we could still type in Shell window and get the response back from the Shell commands. Most of the time, however, we would see a small hick up (sometimes less then a second) and debugging would continue after that. As you see it was completely random. Obviously if the only thing we saw were hick ups we would have never even noticed the problem.
Everyone knows how frustrating these ambiguous bugs could be when you don’t have a consistent repro. We got lucky because we saw that it would happen when Hopper tried to print out a 27 character string.
Why 27 characters? As it was mentioned in the previous post the way USB Serial KITL indicates the end of the transmission is by sending a partial packet (less then max packet size). However, if the data sent over the wire happens to be a multiple of max packet size then a zero length packet needs to be sent to indicate the end of the transmission. The default packet size that most manufactures use right now is 64 bytes. When you send 27 character text you will get exactly 64 bytes. Let’s look at the debug message packet to see why it is 27 character string. Here is how the Debug Message packet looks when sent over the USB Serial:
If you look at the header file you’ll see the following sizes:
That’s 34 bytes plus 27 byte string. Which is 61 bytes. Where is the other 3? Let’s look at the packet that was sniffed from the line:
There are 3 extra bytes at the end: 0x0D, 0x0A and 0x00. They are \n, \r and \0.
So that’s how the mystery of 27 byte string was solved.
We know that this hick up or hang happens when the data sent by the client is aligned to exactly max packet size. But why in some cases do we see a hick up and in the others a hang? And why would we still be able to type shell commands? That didn’t make any sense.
Why would we still be able to type shell commands? The answer lies in the KITL Architecture. KITL has a full duplex architecture meaning it has separate channels for sending and receiving data. It also has 16 separate clients (see the picture blow). Debug Message is client number 0 and Shell command window is client number 10 then you can bring down client number 0 but client number 10 would still work. The problem we were having was on the Transport level which is below the Client level but since the Transport was still Polling the data underneath it was able to respond to our Shell requests and only the Debug Message Client was stuck.
But why would it sometimes stop just for a split second and other times it would hang? The answer is simply luck! Let’s say the device was trying to send 4 packets which were exactly 64 bytes each. The host would receive all 4 packets however on the last packet it would just stop because it is waiting for a transmission end packet which never comes because there is no more data in the queue on the device. That is the case when the Debug Message output would get stuck. If the device were to send something else at that moment then the data flow would resume and packet number 4 would be flashed out.
There are other ways that would allow the data flow to be resumed. When the host waits for the packet to come but it never does for any reason then the host would send a packet back asking the device to retransmit the data. However if the only thing the device does is retransmit the last packet it will still be aligned at 64 bytes and the host would still think that there is more data. In most cases though the device send something else with the retransmit (echo, acknowledgment etc.) which will generate a partial packet in the end and the data flow will be resumed. I hope now you see why I said “luck” in the beginning.
In the end this was a very challenging issue and I’m glad that we had USB sniffer that allowed us to look at the raw packets. Some of you might not have the USB sniffers that’s why we thought it is important for us to share our finding with you.