Uninitialized floating point variables can be deadly


A colleague of mine related to me this story about uninitialized floating point variables. He had a function that went something like this, simplified for expository purposes. The infoType parameter specified which piece of information you're requesting, and depending on what you're asking for, one or the other of the output parameters may not contain a meaningful result.

BOOL GetInfo(int infoType, int *intResult, double *dblResult)
{
 int intValue;
 double dblValue;

 switch (infoType) {
 case NUMBER_OF_GLOBS:
  intValue = ...;
  break;

 case AVERAGE_GLOB_SIZE:
  dblValue = ...;
  break;
 ...
 }
 *intResult = intValue;
 *dblResult = dblValue;
 ...
}

After the product shipped, they started geting crash reports. This was in the days before Windows Error Reporting, so all they had to work from was the faulting address, which implicated the line *dblResult = dblValue.

My colleague initially suspected that dblResult was an invalid pointer, but a search of the entire code base ruled out that possibility.

The problem was the use of an uninitialized floating point variable. Unlike integers, not all bit patterns are valid for use as floating point values. There is a category of values known as signaling NaNs, or SNaN for short, which are special "not a number" values. If you ask the processor to, it will keep an eye out for these signaling NaNs and raise an "invalid operand" exception when one is encountered. (This, after all, is the whole reason why it's called a signaling NaN.)

The problem was that, if you are sufficiently unlucky, the leftover values in the memory assigned to the dblValue will happen to have a bit pattern corresponding to a SNaN. And then when the processor tries to copy it to dblResult, then exception is raised.

There's another puzzle lurking behind this one: Why wasn't this problem caught in internal testing? We'll learn about that next time.

Comments (49)
  1. Nik says:

    Stunning. Looking forward to the next one.

  2. Tom says:

    Just a guess, but perhaps the internal testing was done against debug builds?  Debug builds fill unused memory with 0xCC, which is not a signaling NaN.

  3. David Walker says:

    @Karellen:  "This was in the days before Windows Error Reporting, so all they had to work from was the faulting address, …"

    Maybe this was also in the days before such compiler warnings were common.

  4. Mike Dunn says:

    main()

    {

    union { double d; unsigned char uc[sizeof(double)]; } u;

     cout << "Bytes: ";

     for (int i = 0; i < sizeof(u.uc); i++)

       cout << hex << (unsigned) u.uc[i] << ‘ ‘;

     cout << endl << "Class: " << _fpclass(u.d) << endl;

     cout << "Double val: " << u.d << endl;

    }

    prints:

    Bytes: cc cc cc cc cc cc cc cc

    Class: 8

    Double val: -9.25596e+061

    Class 8 is _FPCLASS_NN (negative normal), so a bit pattern of all 0xCC bytes is a legal value for a double. VC6 didn’t warn me about using uninitialized data, although I did get a warning on "double d; cout << d;"

    As for why this wasn’t caught, possibly:

    1. Devs never ran release builds (entirely possible).

    2. QA never ran release builds (I hope this isn’t the case).

    3. Internal machines were old enough that they didn’t support raising an exception upon encountering a SNaN.

    4. Raymond said "If you ask the processor to" watch for SNaNs – internal machines that did support the feature had it turned off.

  5. Mike Dimmick says:

    You can set the floating point exception behaviour by calling _controlfp. The default appears to be that all floating point exceptions are masked. However, a program can change them at any time, which affects the behaviour for the thread that changes them. Therefore something that doesn’t cause a problem in testing in one host may be a problem in another. I seem to recall that VB6 turned floating point exceptions on, or maybe I’m confusing that with the fact that all VB errors were raised with a ‘Floating Point Inexact Result’ exception.

  6. Joe Butler says:

    Did internal tests always call in this order:

    res= GetInfo(AVERAGE_GLOB_SIZE, &intResult, &dblResult);

    res= GetInfo(NUMBER_OF_GLOBS, &intResult, &dblResult);

    where real world use might end up calling for num_of_globs first?

    I.e. was the local dblValue simply the same point on the stack as when it was last used and was loaded with valid data.

  7. Anthony Wieser says:

    Mike

    The best bit is when something else that gets loaded into your process decides to call _controlfp on your behalf, like a certain printer driver did, whenever you loaded up the print settings.

    As a result I had to abandon any hope of having control of floating point exceptions, as I couldn’t predict when this would have been called.

    That one’s only beaten by someone I used to work with who thought it was a good idea to call setlocale in DllMain in response to DLL_PROCESS_ATTACH in a COM server.   So your program worked fine until you loaded any object in his library.

  8. Andrew says:

    Anthony,

    I had a project a few years back that suffered from the same problem you had with a third-party DLL resetting the control word when hooking into my app. I contacted Microsoft’s support people and found out that DLLs created by Microsoft compilers prior to VS 2003 will automatically reset the floating point control word when loaded.

    I think this exchange happened shortly after VS 2003 came out, so there were a lot of existing DLLs out there that had this problem. I bet there are still a lot of DLLs out there that have this problem. So it’s a small consolation, but your printer manufacturer was probably not being lazy or malicious.

  9. asdf says:

    "Why wasn’t this problem caught in internal testing?"

    Huh, I know VS used to set uninitialized vars to 0 in debug mode, but I would have thought it would get hit in Release mode…unless they were testing with a Debug build.

  10. Miles Archer says:

    Why wasn’t this caught in testing?

    Perhaps it wasn’t reproduceable and considered a fluke that would never happen in production.

  11. DriverDude says:

    "Why wasn’t this problem caught in internal testing?"

    Maybe they didn’t do enough testing. Assuming their product was even moderately successful, more customers will be using their product for much more time than their testers ever did.

    The last part is important: how much time do testers use an app before restarting it (or the computer), compared to real-world usage where the app could be running for hours or days at a time.

  12. One source of trouble here is the model of having the FP exception mask (and rounding control bits, etc) be a piece of global state that "acts at a distance" on all FP computations. It simply does not scale to applications that are put together from libraries of different origin — i.e. every nontrivial program.

    Yes, that is how the hardware works, but C is enough of a high-level language that this detail should not be exposed to the programmer. Not many programmers consciously sets out to write a function that implements a complex floating-point computation using whatever rounding control for its intermediate results the caller has set up.

    In an ideal world, programmers would specify exactly which FP behavior they wanted for which piece of code (using pragmas or whatnot), and the compiler would enforce the FPU state whenever control may reach FP code from an unknown location.

    Unfortunately, in the real world this would probably be too much of a performance liability to sit well with numeric computing people. The i386’s FLDCW instruction does not appear to be designed to be used that heavily. It ought to have had a fast single-byte variant with an immediate argument …

  13. Yuhong Bao says:

    BTW, this is the same reason why copying memory by treating it as floating point numbers do not work.

  14. SomeUser says:

    Anthony Wieser: i think you got it reversed. NAN is any floating point number with exponent=0xff. those are split into two categories:

    (1) non signaling NAN – mantissa=0

    (2) signaling NANs (notice there are many) – mantissa != 0. in fact you pass an error code in the mantissa bits (it was designed like this).

  15. Karellen says:

    Never mind testing, how did it pass check-in? The compiler should have warned against the use of an uninitialised variable. Why didn’t it? If it did, why was the warning not properly investigated?

    [Many people turn off warning C4701 because it generates false positives on code like this:
    void f(int i, int *p)
    {
     int j;
     if (i) j = 1;
     if (i) *p = j;
    }
    

    -Raymond]

  16. Anthony Wieser says:

    I once tried to use SNAN’s on purpose, but couldn’t for the life of me get it to work.  

    The spec seems to be:

    An SNaN is a NaN with the most significant fraction bit clear. It is used to signal an exception when used in operations. SNaN’s can be handy to assign to uninitialized variables to trap premature usage.

    Given this structure of a float with 1 byte packing,

    // @struct IEEE_FLOATREP | allows bit access to 4 byte floats

    typedef struct ieee_floatrep

    {

    unsigned low_mantissa:16; // @field low 16 bits of mantissa

    unsigned high_mantissa:7; // @field high 7 bits of mantissa

    unsigned exponent:8;     // @field exponent of floating point number

    unsigned sign:1; // @field sign of floating point number

    } IEEE_FLOATREP;

    Shouldn’t this generate an SNAN?

    // @func void | SET_SNAN | sets a number to a Not A Number (SNAN)

    // @parm float * | t | pointer to value to set to SNAN

    #define SET_SNAN(t) (((IEEE_FLOATREP *) t)->high_mantissa = 0x00,

    ((IEEE_FLOATREP *) t)->exponent = 0xff)

    I couldn’t for the life of me get it to signal, or show up in the debugger as an SNAN.

  17. steveg says:

    "this is the same reason why copying memory by treating it as floating point numbers [does] not work."

    That’s actually not entirely stupid. On 32 bit usually sizeof(long double)==16, and sizeof(unsigned long)==8, therefore fewer iterations in a simple memcpy implementation. I wonder if anyone’s actually tried and measured it (a very quick google found nothing)?

    As for the teaser question at the end of the post, perhaps they were using a 386 without a 387 (or a 486SX), on the hypothesis there might be a difference in the software vs hardware implementation.

  18. Tom says:

    @steveg:  You might be surprised at how the compiler implements that.  In most cases, the compiler will emit code to push the double into the FPU stack, then pop it off into the target location.  When doing this, you have to deal with NaNs and the possibility that, when the double is extended into a long double (80-bits), it might not come back out the exact same.  Indeed, it’s possible that it may even be slower than just copying DWORDs as some FPU instructions take many more cycles than comparable integer instructions.

    In the end, it’s just better to use native-size unsigned integers to bulk copy data.

  19. See, if they’d prototyped it like this, no worries:

    union int_or_double {

      int _int;

      double _double;

    };

    BOOL GetInfo(int infoType, int_or_double *intOrDoubleResult)

    … or even better

    BOOL GetInfo(int infoType void *intOrDoubleResult)

    … or better still

    BOOL GetInfo(void *mystery_input_output_argument);

    On a more serious note, if I was implementing this function I would seriously consider making it part of the contract that the un-asked-for output parameter be NULL, and enforce that at runtime.

  20. Mike Dunn says:

    Heh, how soon they forget. Early FPS engines squeezed out extra speed by copying memory with the FPU.

  21. steveg says:

    @Tom: Yup, I’ve no doubt it’s slower (and stupid), just curious how much slower. The routine would have to call _controlfp appropriately. And copy using pointers to long doubles. And assume it was copying multiples of sizeof(long double).

    Reader exercise: s/void|char/long double/g in memcpy.c in crt/src and see what happens.

  22. Sam says:

    There are some memcpy() that use x86 MMX or SSE instructions, e.g., the Linux kernel.

    Are SNAN’s unqiue to the x86 architecture? is it possible to write portable code that uses SNANs?

  23. Rivari says:

    Hi;

    what does sizeof(long double)? size of long + double? I don’t think that u can init a "long double var = 0".

    Thnx for calrifying this for me.

  24. ultrano says:

    @Rivari : that’s an 80-bit float. Full-precision stuff, only slower to load and store from/to memory, but not compute. Dropped in later compilers.

  25. mikeb says:

    > what does sizeof(long double)? size of long + double? I don’t think that u can init a "long double var = 0" <<

    You should probably just try it instead of guessing, you may be surprised…

  26. mikeb says:

    @ultrano:

    > that’s an 80-bit float. Full-precision stuff, only slower to load and store from/to memory, but not compute. Dropped in later compilers <<

    It’s not necessarily an 80-bit float – it simply has to be able to represent at least all double values with no loss of precision (in fact a long double may end up being implemented exactly as a double).  The standard makes no promises that long doubles computations will be as fast as computations with doubles.  

    Also, why would later compilers drop a type that’s in the standard (since C89)?

  27. Yuhong Bao says:

    "(in fact a long double may end up being implemented exactly as a double)"

    In fact, that is exactly what current MS compilers implement it as.

    Back in the 16-bit days however, long double was supported as a 80-bit floating point number, which cannot be supported by SSE.

  28. Dean Harding says:

    "this is the same reason why copying memory by treating it as floating point numbers do not work."

    Thank god! Now I understand why I shouldn’t copy memory by *treating it as floating point numbers*. I never understood that rule until now.

    /sarcasm

  29. Ray Trent says:

    Let me just say that anyone who turns off the uninitialized variable warning *for whatever reason* is a fool and deserves every last crash they’ve earned.

    The *only* proper way to resolve this warning is:

    void f(int i, int *p)

    {

    int j = 0; // <- instead

    if (i) j = 1;

    if (i) *p = j;

    }

    Of course, 0 is arbitrary, it’s just as good as uninitialized, but feel free to use 0xDEADBEEF if you prefer :-).

  30. Ian says:

    One convenient way to initialise a floating point NaN in C/C++ (without signalling an exception) is as follows:

    static const union { unsigned long l; float f; } NaN = { 0x7f800000 };

    #define NotANumber NaN.f

    (Modify as desired for double types and adjust the bit pattern to taste for signalled/non-signalled/inf/etc. Obviously this is not portable to some 64-bit compilers.)

    This might be perfectly obvious to some people here, but it took me a while to discover.

  31. Dog says:

    Can’t we just generalise this to "Uninitialised variables can be deadly"?

    Seriously. How much effort does it take to add an ‘= 0’, ‘= ""’ or ‘= NULL’ to the end of each variable definition and remove a whole class of difficult to debug, subtle and inconsistent bugs?

    I like the fact that C# considers the use of an uninitialised value an error.

    Can someone tell me if MSVC has such an option (without treating all warnings as errors)?

  32. steveg says:

    Interesting! memcpy implemented as a simple loop is faster using (long double*) than (char*). In debug mode it’s much faster.

    Debug

    —–

    memcpy_longdbl:  1156

    memcpy_char:     7875

    memcpy (stdlib): 1016

    Release (optimised for speed)

    ——-

    memcpy_longdbl:   953

    memcpy_char:     1625

    memcpy (stdlib):  984

    The test looped 1000 times copying from one 200,000 long double heap allocated block to another on an Intel 2.13ghz Core 2 6400. Visual Studio 2003. I used the C implementation in VC’s memcpy.c as the basis for both versions.

    By default the MS stdlib memcpy calls RTLMoveMemory. http://msdn.microsoft.com/en-us/library/ms803004.aspx

    (YMMV, I did this over a sandwich).

  33. Pedant # 27 says:

    @RayTrent belched thus:

    ===

    Let me just say that anyone who turns off the uninitialized variable warning *for whatever reason* is a fool and deserves every last crash they’ve earned.

    The *only* proper way to resolve this warning is:

    void f(int i, int *p) {

    int j = 0; // <- instead

    if (i) j = 1;

    if (i) *p = j;

    }

    Of course, 0 is arbitrary, it’s just as good as uninitialized, but feel free to use 0xDEADBEEF if you prefer :-).

    ===

    Agree with your ‘fool’ comment but not your ‘only way to resolve’ one.

    The code Raymond provided is perfectly valid – there will be no attempt to use j unless it’s first initialized (i controls both actions).

    It’s also fixed by replacing the if statements by a single:

    if (i) { j = 1; *p = j;}

    but there may have been code between the two assignments which Raymond left out for cleanliness (almost certainly since the code segment is pretty useless as-is).

    Cheers.

  34. Larry Lard says:

    Raymond> Many people turn off warning C4701 because it generates false positives

    Was there a time when you couldn’t #pragma disable a compiler warning around a piece of code that generated a false positive for that warning?

  35. doynax says:

    Another interesting point is that IEEE floats aren’t a total order, as NaNs are unordered. A fact which can wreak havoc with searching and sorting algorithms.

  36. no one in particular says:

    Is it just me, or is the original code really bad style in changing a variable it was not asked to change?

    NUMBER_OF_GLOBS should change only ever the int-variable, as AVERAGE_GLOB_SIZE should only touche the double result.

    Im my hands, even the ASSERTing for valid pointers would be in the switch-clause.

  37. JamesNT says:

    @Raymond,

    I would say the bug in question was not caught in internal testing because this bug may result only after you’ve used the program a sufficient amount of time.

    JamesNT

  38. Anthony Wieser says:

    Well, eventually I’ve managed to track down what’s going on.

    It appears that for an SNAN to work the value of the mantissa needs the high bit clear, but also the mantissa must be non-zero to be called a signalling NAN.  Unhelpfully, the debugger always shows them as QNAN’s.

    The correct definition is:

    // @func void | SET_SNAN | sets a number to a Not A Number (SNAN)

    // @parm float * | t | pointer to value to set to SNAN

    #define SET_SNAN(t) (((IEEE_FLOATREP *) t)->high_mantissa = 0x01,

    ((IEEE_FLOATREP *) t)->exponent = 0xff)

  39. JamesNT says:

    @Dog

    "Seriously. How much effort does it take to add an ‘= 0’, ‘= ""’ or ‘= NULL’ to the end of each variable definition and remove a whole class of difficult to debug, subtle and inconsistent bugs?"

    Apparently, it is considerably difficult.  The vast majority of programming I do is on programs I have "inherited" and I can tell you with harsh conviction that I see uninitialized variables all the time.  Another major problem I see is a serious lack of error checking.  For example, no error checking in the class that handles the connection to the Access database of an application.  If the Access database isn’t there or is otherwise unavailable (i.e. think file share) then the user gets some criptic error message from VB6 or C++ that they will never understand in a million years instead of a nice dialog box that says "Error: Database unavailable.  Please ensure the comuter containing the database is available and connected to the network" or other message meant for human consumption.

    JamesNT

  40. Dave Harris says:

    In C++ you can get similar problems with instance variables, eg if there is a assignment operator or copy constructor that copies each member. (I believe I’ve seen this for compiler-generated functions, but it can certainly happen for hand-crafted ones.)

    It’s slightly harder to initialise instance variables. You have to update each constructor, every time you add one. It’s not just "=0" at the point of declaration.

    In most other situations, an uninitialised local variable is a sign that its scope is too big (it should have a defined value whenever it is in scope), which in turn suggests the function is too big and/or incoherent (in the technical sense). GetInfo() is doing several different things at once.

  41. Michiel says:

    There have been proposals to default-initialize ints *unless explicitly requested*.

    Ie you’d have to write

    int a = __uninitialized;

    switch (b) {

     case 1: a = 7; break;

     case 2: a = 5; break;

     /* … */

    if you did not want the overhead of int a;

    Old code would of course still compile, at a slight decrease in speed.

  42. You have to update each constructor, every time you add one.

    Is this true?  Can’t you do

    class CFoo {

    int _1;

    int _2;

    int _n;

    public:

    CFoo() : _1(0), _2(0), … _n(0) {}

    CFoo(double d) : CFoo() { … } // automatically initializes

    };

    Or is this a Java-ism?

  43. Ian says:

    I just know I’m going to be shouted down for this, but…

    The problem with initialising each and every variable when you declare it is that it implies that you haven’t thought properly about how they are used.

    Everyone (I hope) will agree that if a variable is declared and then at its first use is initialised, then technically there is no potential error in the code. It just seems to me that the ‘safety’ argument in favour of additionally initialising such variables at declaration is suprious because real safety comes from understanding the code and *being careful*.

    But apparently care is not seen as being of value any more in todays health-and-safety culture… :-(

  44. Michiel says:

    @Maurtits : It’s (still) a Java-ism.

    The closest current C++ alternative would be

    CFoo (double d) { *this = CFoo(); /*…*/ }

    It does require that CFoo::operator=(CFoo const&) can deal with the uninitiazed *this

  45. mikeb says:

    >> The problem with initialising each and every variable when you declare it is that it implies that you haven’t thought properly about how they are used <<

    Unfortunately, error happen.  Especially when code is modified later.  With your argument assertions would be undesirable and we could go to not needing to declare variables at all and have them instantiated at first use (note that  there are a large number of enthusiasts of dynamic languages who would agree with this – I’m not one of them).

    Ideally, you won’t declare the variable until it’s first use, so declaration is intialization.

    If you have to declare early, then it harms nothing to give a default initialization – if the initialization is a throwaway, the compiler will optimize it away.

    [The case where the compiler raises a spurious “Possibly uninitialized variable” warning is exactly the case where it won’t optimize it away! -Raymond]
  46. Steve says:

    some object oriented guy used to tell me :

    When you encounter a "switch" statement something is likely to be wrong around.

    here instead of doing a switch, we could have make a method for each types. we won’t have encounter any of theses issues. and testing would have been simpler as well.

    cheers,

  47. SuperKoko says:

    From Ian:

    "The problem with initialising each and every variable when you declare it is that it implies that you haven’t thought properly about how they are used."

    But, uninitialized variables induce very impredictible bugs.

    If you want to get the best of both worlds, initialize int variables to 0xDEADBEEF as they’re very likely to induce early detected bugs.

  48. mikeb says:

    > [The case where the compiler raises a spurious "Possibly uninitialized variable" warning is exactly the case where it won’t optimize it away! -Raymond] <<

    In that case, the code should probably be refactored or you should live with the cost of the unnecessary initialization.  Of course there are always exceptions to rules, but I think that in general it’s worthwhile to keep compilers happy.

    For example, in the case where the code may be part of a a critical loop so you don’t want to pay for the initialization, then clearly the code needs every cycle it can get. The fact that the compiler is issuing the spurious diagnostic is a clue that the compiler may not have determined that the 2 conditionals are equivalent, so a refactor that eliminates the message may well pay off in further optimizing your code.

  49. Ian says:

    @mikeb: This is what I meant about being careful. I wouldn’t use an assert() to check a condition that is obvious from the code. I do use asserts – I use them mainly for checking conditions where I believe the logic of the code makes that condition necessarily true but I can’t actually ‘see’ it directly.

    But if some code is so complicated that you can’t easily see whether a variable is initialised before it is used, then the problem is more serious than an uninitialised variable.

    I think you are spot on with your later comment "In that case [the compiler can’t tell if a variable is initialised], the code should probably be refactored".

Comments are closed.

Skip to main content