When is it appropriate to use the current processor number as an optimization hint?

Some time ago, on the topic of avoiding heap contention, I left an exercise that asked whether it would be appropriate to use the current processor number (as reported by Get­Current­Processor­Number) to select which heap to use.

In this case, the answer is no. While using the current processor would avoid contention at allocation, it would make contention at deallocation even worse.

Suppose thread 1 is running on processor 1 and allocates some memory. It allocates it from heap 1. Later, thread 2 is running on processor 1 and allocates some memory. It also allocates it from heap 1.

Time passes, and the two threads are now running simultaneously, say one on processor 1 and another on processor 2. They both go to free the memory, but since you have to free the memory back to the heap from which it was allocated, the fact that they are running on separate processors right now is immaterial. They both have to free the memory back to heap 1, and that creates contention.

If we had assigned heaps based on thread, then they would have allocated from different heaps and freed back to those different heaps. No contention. (Assuming the threads were assigned different heaps.)

Okay, so what guidelines can we infer from this analysis?

If you are going to use the current processor as a hint to avoid contention, the entire scenario needs to be quick. If the processor changes while your scenario is running, then you will have contention if the new thread also tries to perform that same processor-keyed operation.

In the case of memory allocation, the memory is allocated, and then used for a while, possibly a long time, before finally being freed back to the heap from which it was allocated. Since the scenario is a very long one, using the current processor number as a hint is going to run into a lot of cases of accidental contention.

On the other hand, if you had a linked list of available memory blocks, then using the current processor may be helpful. Keep a free list per processor. When it's time to allocate a node, you consult the free list for the current processor. And when you want to free a node, you free it back to the list associated with the processor doing the free.

Unlinking a node from a linked list and pushing a node to the front of a linked list are relatively fast operations, so the processor is unlikely to change out from under you. The important distinction here is that we don't try to free the node to the list it originally came from. We return it to the list that belongs to the current processor at the time it is freed.

Of course, if you find that the free list is empty, then you'll have to go create some new nodes. Yes, this introduces the risk of contention, but creating new nodes will be a comparatively slow operation, so the hope is that added risk of contention is not noticeable in practice.

Bonus chatter: In the discussion that followed the exercise, there were a pair apparently contradictory claims:

  • The scheduler tries to keep load even across all processors.
  • The scheduler tries not to move threads between processors.

Both are right.

When a thread is ready to be scheduled, the scheduler will try to put it back on the processor that it had been running on most recently. But if that processor is not available, then the scheduler will move it to another processor.

In other words, the "try not to move threads between processors" rule is a final tie-breaker if the scheduler has multiple processors available to it. But if the thread becomes ready, and there is only one available processor, then the thread will run on that processor. If the system has decided to shut down a processor to conserve power, then the thread will go to a processor that still has power.

Comments (5)
  1. JAS says:

    When the CPU freeing the blocks experiences a glut, how does this get resolved? I can only imagine some kind of global lock-less queue where you post all blocks over N, and where the empty CPUs will go if they run out.

    1. Jon says:

      The global queue doesn’t need to be lockless. Since you’re not going to be using the global queue often, and since you’re less likely to have multiple CPUs accessing it at the same time, performance doesn’t matter so much. You can have rules like:

      * On free, if there are 2N blocks in the CPU-local queue, take a global lock and move N blocks from the CPU-local queue to the global queue, leaving N blocks in the CPU-local queue.

      * When you want to allocate a block, if the CPU-local queue is empty, take a lock, take one block from the global queue, and also move N blocks from the global queue to the CPU-local queue (or until the global queue is empty, if it had less than N blocks in it). If the global queue was empty, allocate 2N+1 blocks using malloc() and put N of them in the CPU-local queue, N of them in the global queue, and return 1.

      With this design, each CPU will only access the global queue when there have been N more allocations than frees, or N more frees than allocations. The maximum amount of memory used is (peak_actual_usage + 2*N*number_of_cpus) blocks (or less).

  2. Joshua says:

    I vaguely recall seeing a heapfree-like function that could free a block on to the wrong heap somehow. I shudder to think what heapdestroy would do to that.

  3. Dennis says:

    I’m not sure if I buy this argument. It seems like using the processor id will reduce contention on allocation and won’t do *worse* than random assignment for free. (I can’t think of a way to assign threads to heaps guaranteed to do better than random.) If the workload does a lot more allocating than freeing, (perhaps it allocates incrementally and frees it all in the end) then processor id might pull ahead. (It might end up allocating memory from many heaps, though.)

    There are problems using the processor id if the processes are loaded funny, or if the CPUs are asymmetric. On the other hand, maybe using the processor id helps in the NUMA case.

    Using per CPU lists is obviously better since you can go lock-free, but if you are going to take a lock, using the processor id isn’t *that* bad.

    1. You can assign a heap to each thread. Then there is no contention because the thread will free the memory back to its dedicated heap.

Comments are closed.

Skip to main content