Unsigned considered harmful


(or is “xxx considered harmful” completely worn out as a meme?) 

 


I believe that, in general, people should avoid unsigned variables, even when dealing with quantities which should only be positive. I have three major problems with unsigned variables:


 


Subtraction doesn’t always make sense


Unsigned numbers are used to model positive integers and positive integers aren’t closed under subtraction.While subtracting two integers always gives another integer, subtracting two positive integers doesn’t always give a positive integer. Integer subtraction is much more useful than positive integer subtraction!


 


Of course with wraparound the subtraction gives an unsigned integer, but that doesn’t make arithmetic sense. Signed integers have the same problem, but only at the extremes where wraparound occurs. Even with idealized infinite-bit-length numbers subtraction of unsigned integers is problematic.


 


Lack of Error-checking on underflow


If subtraction does underflow most systems won’t generate a runtime exception if they go negative (which would be useful), instead they silently wrap-around (which is not useful). In the end unsigned variables simply alias invalid negative values to positive ones.


 


Lack of Invalid Values


It is extremely useful when a type has support for a clearly invalid/unused value. This can be used both for error-checking and loop termination. Take a loop that iterates through a string backwards:


 


int IchLastCommaInString(const char * const sz) {


    int cch = strlen(sz);


    for(int i = cch-1; i >= 0; –i) {


        if(‘,’ == sz[i]) {


            return i;


        }


    }


    return -1;


}


 


If we change the code to use an unsigned variable (“Array indexes can never be negative!”) then we lose the ability to easily detect the termination condition:


 


int IchLastCommaInString(const char * const sz) {


    int cch = strlen(sz);


    for(unsigned int i = cch-1; i >= 0; –i) {  <= BAD! unsigned ints are always >= 0


        if(‘,’ == sz[i]) {


            return i;


        }


    }


    return -1;


}


 


To use an unsigned variable we often end up on the dark and dismal path of “if(x > BIGNUM)” where BIGNUM is chosen to be a value so large that we ‘know’ that x is really a negative number in disguise.


 


Just use signed variables.


Comments (5)

  1. Keith Farmer says:

    Or just lose the FUD, and perform the equally-easy termination check:

       for(unsigned int i = cch; i > 0; –i) {

           if(‘,’ == sz[i-1]) {

               return i;

           }

       }

  2. The code above is another example of what I dislike about unsigned integers. In an attempt to avoid the underflow problem while still using an unsigned integer the code was made less clear and a bug was introduced (the index that is returned is off by one)!

    Even when the code is right, unsigned integers can create a maintenance boobytrap. It isn’t clear why everything in the loop has to be offset by one so a future change might use the (natural looking) ‘i >= 0’ idiom, which will crash when a comma isn’t present.

  3. Stephen Cleary says:

    I’m going to have to respectfully disagree with you on this one. Unsigned integers should be used instead of signed integers if the concept is only-positive (especially for managed code) for the following reasons.

    Using signed integers for an unsigned concept has the following results:

    1) Confusion of API, leading to unnecessary/duplicate error checks. e.g., a function returns ‘int’ for the file size. "Will it return -1 if the file doesn’t exist? I *think* it would throw an exception in that case, but what if it returns a negative number? That would cause my program to behave badly. I’d better check it just in case as well as catching the exception. But I wonder what a negative number would mean? What type of exception would I throw? I’ll just throw an internal error exception." The snap answer "just check the docs" simply doesn’t work when 3rd party tools have poor documentation – or even good documentation but neglect to indicate the return value is always positive.

    2) Masks underflow errors. With signed integer math, you just get a (meaningless) negative value. Unsigned subtraction (with underflow checking) catches these bugs at the point where they occur instead of propogating through the rest of the logic.

    3) Invites misuse of negative values with special meanings. Special values should not be defined within a normal datatype range. Firstly, there’s confusion over what the value means: does -1 mean "missing value" or "invalid value detected" or "not applicable"? (Yes, I’ve seen systems where different negative values have different *types* of special meanings – what a mess!) Secondly, *all* code has to deal with these special values – variables are checked time and again for them – to prevent propogation of meaningless values and incorrect math.

    (If you absolutely must have a single "invalid" value, then use a nullable type with careful documentation on what a null value actually means. If you need different types of special values, then use an enum+data or something similar.)

    4) Prevents the usage of the full scope of values. Storing unsigned concepts in a signed variable halves the maximum value available. Historically, this has always caused problems later – the most recent example of this is the 2GB limit (because when those programs were written, why, no one could *ever* have that much memory!).

    In conclusion, I believe the benefits of saying what you mean (unsigned variables for unsigned concepts) outweigh the benefits of sloppy math and built-in invalid values.

         -Steve

    P.S. Your second argument doesn’t hold in managed languages when checking is on; and even when it does (unchecked, or C++), there’s a corresponding counter-argument that overflow isn’t checked on signed integer math.

  4. Keith Farmer says:

    laurion:  your "booby trap" complaint about the inclusion of -1 fails just as easily.  You’re putting a -1 either in the loop initializer, or at the point of use.  If you look closely, all I did was transfer 2 characters from your code, and remove another.

    Take your pick:  you’re doing it either way.

    The true solution is to create a pre-packaged iterator over the array, or (alternatively) create a variable to hold the index value rather than create some assumption about the relationship between iteration value and index, rather than complain about a data type that doesn’t actually make the situation worse (and which, as pointed out by Stephen, can improve matters).

    But as you allude to, this is all what we might expect of a post titled "<foo> considered harmful".

  5. Stephen’s point about underflow errors is a good one. The problem is that in C/C++ there isn’t underflow checking and in C# checking is off by default (for performance reasons).

    Overflow is a lot harder to get to than underflow, so it worried me less. Unless the memory situation is desperate, use a 64-bit number instead of squeezing out that last bit from a 32-bit number. That makes it a lot easier to get the math right.

    The point about APIs is true, but using unsigned for filesizes creates a new problem — what is the difference in size between two files? Take code that looks like this:

    fileInfo1.Length – fileInfo2.Length

    Although file sizes are always >= 0 the difference in size between two files can be negative. In a checked world this code will generate an exception and in an unchecked world (the default)this will produce a huge number. Take this C# code:

    uint size1 = 10;

    uint size2 = 20;

    Console.WriteLine("The difference is {0}", size1 – size2);

    At any rate, there are a lot of arguments for unsigned as well as against. The point about the iterators is a good one too (we should be raising our level of abstraction).

    It is possible for a smart, experienced programmer to get unsigned code right. In my experience though, using unsigned numbers does more harm than good and I now prefer to use the next largest signed number (e.g. int64) instead of an unsigned one.