Some years ago I was dealing with an issue that was driving me crazy and it took many hours to figure out what was wrong with the code. It was until I was explaining my findings to a colleague (don't you solve lots of things this way?) that it hit me, and when looking at the original (NAS) code I confirmed what was wrong.
What was more irritating was that the system was working most of the time, and, of course, there was no way to consistently reproduce the issue.
I will explain the issue here, hoping that people don't make the same mistake again.
Think about a single instance code unit which is receiving an event in order to process the incoming data (this was using the ComCom objects), and, in order to catch possible errors in the AL execution, a NAV Timer (Navision Timer) is used to reply successfully or with an error to the receiving event initiator (the COM component that started the conversation).
Normal flow (from the NAS perspective) was:
- ComCom2::MessageReceived would be invoked.
- GlobalVal, a global variable in the single instance Codeunit would be filled with the incoming data to be processed.
- NTimer would be enabled, in order to fire relatively soon (I think it was something like 20 milliseconds).
- NTimer::Timer would fire and would get the value from GlobalVal to process the data and reply to the connection (by NAV architecture, this connection had to be a new one, as, when the MessageReceived event finishes, the ComCom connection drops).
- If any AL error would arise within the NTimer event, the NTimer::TimerError would be triggered, which would allow us to reply an error to the calling component (using the GlobalVal to identify which connection would have to be informed about the problem).
Sounds reasonable, and in most cases this would work great. However, people were experiencing mixed up answers especially when errors were to be reported.
Of course, my first thought was that the component connecting to the NAS was the guilty one, as, after all, it is a multi-thread, multi-connection component (it runs as a Windows Service), and getting all this threads straight is always a tricky business. However, after inspecting the components code, and correcting some issues, the problem was still there.
I decided it was time to get the hands really dirty and started making my own component which would simulate the system and start hammering the NAS with requests using multiple threads and inspecting the replies, and sure enough, especially when the NAS would error, the replies would contain a different ID than the original sending ID on the same virtual connection.
So, there must be something in the NAS that does not work correctly. More code inspection, but nothing showed up until.... Hey, wait a second... We are using a global variable to save data and then reusing this data later on when the NTimer event fires, and, since the NAS is single threaded, this should all work fine, right?
Well, of course the first thing that came into my mind was that there was a possibility that a new incoming event would be "queued" before the NTimer event would be scheduled, and this would mean that a new MessageReceived event would be triggered before the NTimer could even process the previous message. Bingo! That is the issue!
After further investigating the NAS code, it turns out that, although the previous scenario is possible, it actually does not queue sequentially. Incoming events are basically random when the host COM cannot handle the call. I could go very deeply here, but what happens is that, when the NAS COM event is called, the NAS actually checks to see if it is already handling another incoming event. If this is the case, it signals that it cannot process the incoming event (via a RPC_E_SERVERCALL_RETRYLATER) and COM basically waits "some time" before it retries the call. The problem is that this "some time" is not known and it changes, and that is when the trouble really starts.
Basically, what happened in our scenario was that 2 consecutive MessageReceived events will fire (before the first NTimer would) and obviously one GlobalVal would be lost and we would reply twice to a single event (the NTimers would fire, but with the same GlobalVal).
How to solve this
The cure was simple. When the MessageReceived event is fired, we process the information and reply to the event on the same connection. After all, ComCom objects do report back AL errors in XML format, so, in principle, there is no need to complicate matters.
So, after getting rid of the NTimer, and replying to the incoming request within the MessageReceived event, everything works as expected.
Jorge Alberto Torres (jtorres)
Software Developer Engineer