Why does a single integer assignment statement consume all of my CPU?


Here's a C++ class inspired by actual events. (Yes, the certificate on that Web site is broken.) It is somebody's attempt to create a generic value type, similar to VARIANT.

class Value
{
public:
 Value(Type type) : m_type(V_UNDEFINED) { }

 Type GetType() const { return m_type; }
 void SetType(Type type) { m_type = type; }

 int32_t GetInt32() const
 {
  assert(GetType() == V_INT32);
  return *reinterpret_cast<const int32_t *>(m_data);
 }

 void SetInt32(int32_t value)
 {
  assert(GetType() == V_INT32);
  *reinterpret_cast<int32_t *>(m_data) = value;
 }

 // GetChar, SetChar, GetInt64, SetInt64, etc.

private:
 char m_data[sizeof(int64_t)];
 char m_type;
};

...

Value CalculateTheValue()
{
 int32_t total;
 // ... a bunch of computation ...

 Value result;
 result.SetType(V_INT32);
 result.SetInt32(total);
 return result;
}

Profiling showed that over 80% of the time spent by Calculate­The­Value was inside the Set­Int32 method call, in particular on the line

  *reinterpret_cast<int32_t *>(m_data) = value;

Why does it take so much time to store an integer to memory, dwarfing the actual computation to calculate that integer?

Alignment.

Observe that the underlying data for the Value class is declared as a bunch of chars. Since a char is just a byte, it has no alignment restrictions. On the other hand, data types like int32_t typically do have alignment restrictions. For example, accessing a 32-bit value is usually more efficient if the value is stored in memory starting at a multiple of 4.

How much more efficient depends on the processor and the data type.

Of the processors that allow unaligned memory access, the penalty can be zero, or only 10% or maybe 100%.

Many processor architectures are less forgiving of misaligned data access and raise an alignment exception if you break the rules. When such an exception occurs, the operating system might choose to terminate the application. Or the operating system may choose to emulate the instruction and fix up the misaligned access. The program runs much slower, but at least it still runs. (In Windows, the decision how to respond to the alignment exception depends on whether the process asked for alignment faults to be forgiven. See SEM_NO­ALIGNMENT­FAULT­EXCEPT.)

It appears that the original program is in the last case: An alignment exception occurred, and the operating system handled it by manually reading the four bytes from m_data[0] through m_data[4] and assembling them into a 32-bit value, then resuming execution of the original program.

Dispatching the exception, parsing out the faulting instruction, emulating it, then resuming execution. That is all very slow. Probably several thousand instruction cycles. This can easily dwarf the actual computation performed by Calculate­The­Value.

Okay, but why is the result variable unaligned?

Since, as we noted a while back, the way the Value class is defined requires only byte alignment, the compiler is not constrained to align it in any particular way. If there were a int16_t local variable in the Calculate­The­Value function, the compiler might choose to arrange its stack frame like this:

  • Start at an aligned address X.
  • Put int32_t total at X+0 through X+3.
  • Put int16_t whatever at X+4 through X+5.
  • Put Value result at X+6 through X+22.

Since X is a multiple of 4, X+6 is not a multiple of 4, so the m_data member is misaligned and incurs an alignment fault at every access.

What's more, since the Value class has an odd number of total bytes, if you create an array of Values, you are guaranteed that three quarters of the elements will be misaligned.

The solution is to fix the declaration of the Value class so that the alignment requirements are made visible to the compiler. Instead of jamming all the data into a byte blob, use a discriminated union. That is, after all, what you are trying to emulate in the first place.

class Value
{
public:
 Value(Type type) : m_type(V_UNDEFINED) { }

 Type GetType() const { return m_type; }
 void SetType(Type type) { m_type = type; }

 int32_t GetInt32() const
 {
  assert(GetType() == V_INT32);
  return m_data.m_int32;
 }

 void SetInt32(int32_t value)
 {
  assert(GetType() == V_INT32);
  m_data.m_int32 = value;
 }

 // GetChar, SetChar, GetInt64, SetInt64, etc.

private:
 union
 {
  char    m_char;
  int32_t m_int32;
  int64_t m_int64;
  // etc.
 } m_data;
 char m_type;
};

Exercise: One guess as to the cause of the problem is that the assignment statement is incurring paging. Explain why this is almost certainly not the reason.

Bonus chatter: I'm ignoring RVO here. If you are smart enough to understand RVO, you should also be smart enough to see that RVO does not affect the underlying analysis. It just shifts the address calculation to the caller.

Comments (21)
  1. Anonymous says:

    A page fault will not occur because The call to SetType() writes to a member variable. This pages in the object.

  2. Damien says:

    > the four bytes from m_data[0] through m_data[4]

    I make that five.

  3. Karellen says:

    "What's more, since the Value class has an odd number of total bytes, if you create an array of Values, you are guaranteed that three quarters of the elements will be misaligned."

    Won't all Values be equally aligned? I thought struct/class values were padded/aligned to at least 4 byte boundaries? Doesn't that result in the following

     struct s {

       float f;

       char c;

     };

     struct s a[4];

    causing misaligned float access on a[1].f - a[3].f?

  4. Someone says:

    @Karellen: A struct has the alignment of the member with the highest alignment requirement. In your example, the float member pushes the alignment for the entire struct to a multiple of sizeof(float) == 4. If the struct contains a double, the struct will get sizeof(double) == 8 alignment, etc.

  5. Andre says:

    Besides what Anonymous noted (SetType already pages in the Value):

    For a page fault to occur, Value needs to be paged out. Assuming the system isn't trashing from external load (that would cause everything to suffer about uniformly, not just the SetInt32 call), that means some other part of CalculateTheValue causes a page fault that discards Value's page.

    Thus, if I'm not mistaken, at most 50% can be spend on a page fault for SetInt32.

  6. IanBoyd says:

    > In Windows, the decision how to respond to the alignment exception depends on whether the process asked for alignment faults to be forgiven.

    I wanted to see if my application was experiencing any alignment faults (in the same way i would use Application Verifier to check for other issues, i wanted to test disabling Window's built-in fixups when it catches the `EXCEPTION_DATATYPE_MISALIGNMENT` exception). So the question became how do i opt-out of Window's fixups of alignment faults.

    I consulted the MSDN page "Windows Data Alignment on IPF, x86, and x64".

    It seems that there is no way to opt-out of Windows performing alignment fixups for 32-bit processes:

    > On the x86 architecture, the operating system does not make the alignment fault visible to the application. On these two platforms, you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.

    For 64-bit processes, alignment errors are fixed up by the hardware, but MSDN mentions a way to change it:

    > On the x64 architecture, the alignment exceptions are disabled by default, and the fix-ups are done by the hardware. The application can enable alignment exceptions by setting a couple of register bits, in which case the exceptions will be raised unless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD Architecture Programmer's Manual Volume 2: System Programming.)

    Bit 18 of RFLAGS (Alignment Check) controls throwing of exceptions. Using PUSHFQ/POPFQ (the 64-bit RFLAGS version of PUSHFD/POPFD for EFLAGS) allows changing the bits, but it doesn't seem to have any effect.

    The documentation of SetErrorMode and SEM_NOALIGNMENTFAULTEXCEPT say that the option is always on for x86/x64, and only has an effect for Itanium.

    So I asked the question on Stackoverflow "How to enable alignment exceptions for my process on x64?" (stackoverflow.com/.../26919269).

    The general consensus is that there is no way for a user-mode x86 or x64 process to be able to experience EXCEPTION_DATATYPE_MISALIGNMENT exceptions.

    Bonus Reading. Larry Osterman once said:

    > We actually built a version of NT with alignment exceptions turned on for x86 (you can do that as Skywing mentioned). We quickly turned it off, because of the number of apps that broke :)

  7. ChrisR says:

    @IanBoyd: Fascinating, thanks for posting that!  I suppose you already know this as well, but for anyone else who doesn't: You can at least view the total number of alignment faults the system is fixing up by viewing the SystemAlignment Fixups/sec performance counter in perfmon.

  8. Antonio &#39;Grijan&#39; says:

    @Anonymous: true. Furthermore, the stack has been just touched a few bytes away when setting the stack frame, and in the object itself when getting the vtable pointer. All three accesses are a few bytes away from each other, so they probably are in the same page. And it's extremely unlikely for a page with three recent accesses to be swapped out.

  9. Mordachai says:

    Agreed - knowing when it was happening would be a very useful bit of information!!!

  10. Jim says:

    Damien: Then you are not counting like a C programmer (and certainly not like a C++ programmer).

  11. Jim says:

    Damien: I suppose I ought to be helpful rather than just sarcastic, sorry. In case you're not aware, ranges in C and especially C++ are usually specified as a pointer to the first element and a pointer to one past the final element. This works out surprisingly better than pointers to the first and last elements for a number of reasons, especially in how confusing representing a zero-length range would be. For this reason, when you declare an array in C (and C++) the standard specifically guarantees that you're allowed to set a pointer to one past the last element (but not dereference it).

  12. Karellen says:

    @Someone: Ah, that makes sense. Thanks.

  13. SimonRev says:

    My first task after graduating from college was using C++ to write a compiler for a custom programming language.  My code was completely littered with these types of casts (except I used C-Style instead of reinterpret_cast* with nary a care for alignment.  Fortunately the target architecture was a MIPS platform that didn't like unaligned access, so I suspect in most cases I didn't have a major performance issue.

    *After supporting that compiler for 10 years, I deeply regretted using C-Style casts due to several cases where I accidentally cast away a const and then trying to deal with "impossible" changes to variables

  14. IInspectable says:

    @Antonio: There is no vtable. The class is not derived, and lack of a virtual d'tor inhibits that it will (should) be derived from. The type is always known at compiletime, and no vtable indirection is required or generated by the compiler. However, as you note, since the stack has just been touched, and all objects used in the assignment are on the stack. This does make it extremely unlikely for this part of the stack to get paged out.

    [Indeed, if there had been a vtable, then the alignment problem would not have existed, because the vtable itself must be aligned. -Raymond]
  15. Jonas 'Sortie' Termansen says:

    Dereferencing the misaligned pointer the original code is undefined behavior.

    [But why does this undefined behavior result in an assignment statement taking all of the CPU? -Raymond]
  16. Axel Rietschin says:

    In addition to be super-costly, misaligned accesses also turn atomic fetch and store operations into non-atomic ones, in particular in the software-based workaround case. This may have serious implications regarding correctness, or at the very last add to the total cost (lock) and introduce a machine-wide bottleneck in a very insidious way... Additionally, I wonder how well InterlockedIncrement et al. would work when force-fed with an unaligned pointer?

    [Check out the first Remark in the documentation. -Raymond]
  17. Axel Rietschin says:

    OK, the [4] indice is a small mistake than does not add or remove anything to the argument.

  18. Drak says:

    @Jim, Damien: Actually, if you take the North American definition of 'through' [up to and including (a particular point in an ordered sequence)], Raymond includes m_data[4], which means he has actually taken 5 bytes.

    Note that Raymond is not *declaring* the array here, he is listing which elements he wants from it.

  19. Chris Chiesa says:

    This is an excellent example of where old-school knowing-how-things-actually-work has important practical application to contemporary HLL programming. Time was when every programmer learned about alignment as part-and-parcel of learning to program or, at worstt, early in his first project.

    On the other hand, I'd expect any compiler worth its salt to default to aligned data,  and unaligned to be the exception--not the other way around, simply because that's the right/sane thing to do, from the sensible-engineering point of view: it would make for fewer errors/crashes, with less programmer effort, overall.  A basic principle of engineering is that buggy systems should "fail safe," and if your description is correct these compilers are implemented backwards in that respect.

    The fact that "the compiler isn't constrained to [do this]" is no excuse; AT BEST that's a bug in the specs/standards themselves, period.

  20. Gabe says:

    Chris Chiesa: I don't see a problem with the compiler. The user is using a char array which gets 1-byte aligned. Are you saying that the compiler should detect that the user is at some point reinterpreting the chars as ints and change the alignment of the struct? What alignment should the compiler choose? Or are you saying that all char arrays should have more coarse alignment?

    I'm curious how you think the specs/standards could be improved in this respect.

  21. IInspectable says:

    @Chris Chiesa: I would agree with your observation, that this is an excellent example where under-the-hood knowledge is applicable to HLL programming. I do not agree with your conclusion, though: Compilers should not make up for programmer mistakes. The solution is given in the article: "[...] Use a discriminated union. That is, after all, what you are trying to emulate in the first place."

Comments are closed.

Skip to main content