NTDebugging Puzzler 0x00000007: Interlocked functions

 

Today, we will have some fun with interlocked functions.

 

The following section of code is reentrant. A “well meaning” developer used interlocked functions to avoid serializing on a global table lock.

 

Initial smoke testing shows that the code works fine. Sometimes things are not as they appear when doing initial code review. After several hours of heavy stress testing, the developer finds the machine has bugchecked. Analysis of the dump showed that the caller of this function had steamrolled through nonpaged pool writing stacks on top of everyone’s pool memory.

 

The goal of today’s puzzler is to find the bug and describe how it could be fixed with minimal timing impact.

 

Here are a few details before you begin.

 

1. The variable gIndex is an unsigned long global.

2. The gLogArray memory was allocated from nonpaged pool and the size of this allocation is correct.

 

Ready?

 

00:   PLOG_ENTRY GetNextLogEntry ()

      {

01:         ULONG IterationCount = MAX_RECORDS;

02:         PLOG_ENTRY pEntry;

03:         do

            {

04:               if (InterlockedCompareExchange(&gIndex, 0, MAX_RECORDS) == MAX_RECORDS)

05:                     pEntry = &gLogArray[0];

06:               else

07:                     pEntry = &gLogArray[InterlockedIncrement(&gIndex)];

08:               --IterationCount;

09:         } while(InterlockedCompareExchange(&pEntry->Active, 1, 0) != 0 && (IterationCount > 0));

0a:         return (IterationCount > 0) ? pEntry : NULL;

      }

 

 

 

Happy hunting,

 

Dennis Middleton “The NTFS Doctor”

 


[Update: our answer. Posted 6/10/2008]

Thanks for all the great posts! I saw many interesting answers, and a few unique solutions that I hadn’t considered.

Bug Description

A slight time gap exists between the InterlockedCompareExchange and the InterlockedIncrement. For this reason, several threads could pass the check for MAX_RECORDS prior to doing the increment.

Assume that N is the number of threads that pass the check for MAX_RECORDS while gIndex is at a particular value.

If N or more threads are able to pass the check while gIndex is equal to MAX_RECORDS-(N-1) , then gIndex would be incremented beyond MAX_RECORDS.

For example, let’s assume that 3 threads passed the check while gIndex was at MAX_RECORDS-2. Then after the three increments occur, gIndex would be equal to MAX_RECORDS+1. From that point, invalid pointers would be passed out to the caller.

Possible Solutions

There are several ways to solve this problem. Some are more efficient than others. I would avoid doing checks for MAX_RECORDS-1, MAX_RECORDS, or MAX_RECORDS+1 (interlocked or not) since there could potentially be more than two threads involved in the race condition. Such a solution would only reduce the likelihood of an overflow.

There were a few posts suggesting a lock or critical section for the section of code between the compare and increment. That would be one solution, but it would also do away with the benefits of using interlocked functions.

In keeping with the philosophy of keeping the code fast and simple, here’s a solution that gives a very good result with minimal impact.

1. I removed the if () {} else {} and simply allowed gIndex to increment unchecked. With this change, gIndex can approach its max unsigned long value and possibly wrap - we need to keep the array index in check.

2. The modulus operator (added to line 4 below) will divide the incremented gIndex by MAX_RECORDS and use the remainder as the array index. When dividing, the resultant remainder is always smaller than the divisor (MAX_RECORDS). For this reason, the array index is guaranteed to be smaller than MAX_RECORDS. As even multiples of MAX_RECORDS are reached, the array index resets back to zero mathemagically and no interlocked compare is even necessary.

00: PLOG_ENTRY GetNextLogEntry ()

{

01: ULONG IterationCount = MAX_RECORDS;

02: PLOG_ENTRY pEntry;

03: do

{

04: pEntry = &gLogArray[InterlockedIncrement(&gIndex) % MAX_RECORDS];

05: --IterationCount;

06: } while(InterlockedCompareExchange(&pEntry->Active, 1, 0) != 0 && (IterationCount > 0));

07: return (IterationCount > 0) ? pEntry : NULL;

}

With the fix in place, the code is smaller, faster, easier to read, and most of all - the bug is gone. When developing code, always try to think small.

Best Regards,

NTFS Doctor