IEEE-754 Floating point –> Special values and Ranges


As I mentioned in my previous blog, we shall now discuss details regarding range of values for IEEE-754 floating point numbers, Denormalized forms, NAN and a simple algorithms for conversion between IEEE-754 hexadecimal representations and decimal Floating-Point Numbers.


 


Let’s take a look at the special values in IEEE-754 floating point representation.


 


Special values:


 


Exponent field values of all 0s and all 1s are used to denote special values in this scheme



 Zero


Zero is represented with an exponent field of zero and a mantissa field of zero. Depending on the sign bit, it can be a positive zero or a negative zero. Thus, -0 and +0 are distinct values, though they are treated as equal.


 



 Infinity


Infinity is represented with an exponent of all 1s and a mantissa of all 0s. Depending on the sign bit, it can be a positive infinity(+¥) or negative infinity (-¥). The infinity is used in case of the saturation on maximum representable number so that the computation could continue.


 



 NaN


The value NaN (Not a Number) is used to represent a value that does not represent a real number. They are used in computations that generate undefined results so that with NaN the operations are defined for it to let the computations continue. NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero mantissa. There are two categories of NaN: QNaN (Quiet NaN) and SNaN (Signalling NaN).


 


A QNaN is a NaN with the most significant fraction bit set (denotes indeterminate operations).


 


An SNaN is a NaN with the most significant fraction bit clear (denotes invalid operations).


 



 Denormalized


If the exponent is all 0s, and the mantissa is non-zero, then the value is treated as a denormalized number. The denormalized numbers does not have an assumed leading 1 before the binary point. For Single precision, this represents a number (-1)s × 0.m × 2-126, where s is the sign bit and m is the mantissa. For double precision, it represents as (-1)s × 0.m × 2-1022.


 


 


Thus, following are the values corresponding to a given representation:


(Note that b used in the table is the bias)


 































































































Sign(s)


Exponent (e)


Mantissa (m)


Range  for Single Precision values in binary


Range Name


1


11..11


10..00
:
11.11


 


__


QNaN


1


11..11


00..01
:
01..11


 


__


SNaN


1


11..11


00..00


 


< -(2-2-23) × 2127


-Infinity


(Negative Overflow)


1


11..10
:
00..01


11..11
:
00..00


-(2-2-23) × 2127
:
-2-126


Negative Normalized


-1.m × 2(e-b)


1


00..00


11..11
:
00..01


-(1-2-23) × 2-126
:
-2-149


Negative Denormalized
-0.m × 2(-b+1)


__


__


__


-2-150
:
< -0


Negative Underflow


1


00..00


00..00


-0


-0


0


00..00


00..00


+0


+0


__


__


__


> +0
:
2-150


Positive Underflow


0


00..00


00..01
:
11..11


2-149
:
 (1-2-23) × 2-126


Positive Denormalized
0.m × 2(-b+1)


0


00..01
:
11..10


00..00
:
01..11


2-126
:
(2-2-23) × 2127


Positive Normalized


1.m × 2(e-b)


0


11..11


00..00


> (2-2-23) × 2127


+Infinity


(Positvie Overflow)


0


11..11


00..01
:
01..11


 


__


SNaN


0


11..11


10..00
:
11.11


 


__


QNaN


 


 


Range:


 


As, mentioned in the table above, range for the positive normalized no for single precision float is


2-126  to (2-2-23) × 2127  . Note that the bias(b) here is 127.


 


Let’s see how did we arrive we arrive at these ranges. As mentioned table above, the positive normalized form would be represented as 1.m × 2(e-b)   where m is mantissa, e is exponent and b is bias.


 


Thus, smallest normalized no for single precision would come out as 1.0…0( all 0’s after decimal)  x 21-127 such that mantissa is 0 as and exponent is 1, thus:


 


1.0 x 21-127 à   2-126   


 


Now, the largest normalized no for single precision would come out as 1.1…..1( 23 1’s after decimal) x 2254-127 such that mantissa is all ones and exponent is also all 1s except the least significant bit(254), thus this no equals :


 


1……1( 24 ones)                                                224  - 1


-----------------------       x  2254-127         à        --------------     X   2127       à     (2-2-23) × 2127


              223                                                              223


 


Again as mentioned in the table above, range for the positive denormalized no for single precision float is 2-149  to  (1-2-23) × 2-126 .Note that the bias(b) here is 127 and denormalized form would be represented as: 0.m × 2(-b+1)


 


Thus, smallest denormalized no for single precision would come out as 0.00…..1 x 2-127+1 such that mantissa has all the bits as 0 except the least significant bit and exponent is anyways 0, thus:


 


0.00…..1 x 2-127+1 à   2-23  x  2-126  à  2-149 


 


And the largest denormalized no for single precision would come out as 0.11…..1 x 2-127+1 such that mantissa has all the bits as 1 and exponent is anyways 0, thus:


 


1……1( 23 ones)                                                223  - 1


-----------------------       x  2-127+1         à         --------------     X   2-126       à     (1-2-23) × 2-126


              223                                                              223


 


Similarly, you can derive the ranges for double precision floats as well. The following table shows the ranges for the single as well the double precision floats for their positive as well as negative values.


 

















 


Single Precision


Double Precision


Normalized form


± 2-126 to (2-2-23)×2127


± 2-1022 to (2-2-52)×21023


Denormalized form


± 2-149 to (1-2-23)×2-126


± 2-1074 to (1-2-52)×2-1022


 


 


 


Algorithms:


 


Let’s take a look into an algorithm(written in C++) which takes a 32-bit integer (which contains the simple bit representation for a single precision float) and returns an equivalent float value.


 


     float single_float_from_storage_bits(int storagebits)


     {


         //Check the sign bit and assign the same to sign


         int sign = ((storagebits & 0x80000000) == 0) ? 1 : -1;


         // get the exponent value, bit postion 30 - 23


         int exponent = ((storagebits & 0x7f800000) >> 23);


         // get the mantissa value, bit position 22 - 00


         int mantissa = (storagebits & 0x007fffff);


         //if exponent is 0, it could be either 0 or denormalized form.


         if (exponent == 0)


         {


             // since matissa is also 0, definitely this is a 0


             if (mantissa == 0)


             {


                 // We would decide +ve or -ve 0 depending on sign


                 return (sign * 0.0f);


             }


             else


                 // else return the calculated denormalized value


                 return (float)(sign * mantissa * pow(2, -149));


             }


         //if exponent is all 1, then it could be either Infinity or NaN


         else if (exponent == 0xff) 


         {


             //if mantissa is 0, then it is +infnity or -infinity


             if (mantissa == 0)


             {


                 // Use sign to decide +infnity or -infinity


                 return ((sign == 0) ? -INFINITY : +INFINITY) ;


             }


             // Else its a NaN, you can also check SNaN or QNan here


             else return NaN;


         }


         // Now we are sure this is a Normalized no


         else


         {


             // add the implied 24th bit of mantissa here


             mantissa |= 0x00800000;


             // return the normalized form here


             return (float)(sign * mantissa * pow(2, exponent-150));


         }


     }


 


In a similar manner, you can do the vice versa, i.e given a single precision float, you can find out its storage bits representation layout.


Comments (7)

  1. I found these blog entries from Prem to be quite exhaustive. So posting the links here just in case you…

  2. Ken Furphy says:

    Maybe someone could give me (a C novice, with little exposure to IEEE Floating-Point Numbers) some feedback on a solution I have used to the converting a 32 bit integer to a floating point value.  Please note that my application is based on the use of Z-World’s Dynamic C and I have no idea whether it might be successful on other platforms

    The basis of the conversion is that the unsigned long 32 bit integer (k) formats to the equivalent floating point value if it is output using the %f conversion character.

    unsigned long  k;

    float val;

    char valstr[32];

    void main()

     {

     k =           // IEEE floating point number

     sprintf(valstr,"%f", k);

     val=atof(valstr);           // val returns equivalent float value for k

    }  

  3. Tony Proctor says:

    I have a very extensive experience of cross-platform IEEE floating-point, and I can say as a result that it just doesn’t work in practice.

    The IEEE standard allows for a range of NaN values to support different application-defined states but it doesn’t state what bit pattern the "canonical NaN" (as in the result of a ‘Flt Invalid’ exception) uses. Each vendor has therefore adopted different patterns. Some even vary the canonical pattern between chipsets, and again in their software maths library. Also, the standard doesn’t say whether quiet or signalling NaNs have a set sign bit, and so vendors differ again. As a result of all this, it’s almost impossible to define your private NaNs such they they avoid the canonical pattern on all platforms (affects data transportability) and across chipsets from the same vendor.

    On top of this, there is a great overlap between the floating-point abstraction provided by the IEEE standard, and that by the O/S itself, and then again by the maths library associated with different programming languages. It’s a very big mess overall.

    On top of this, much of the nuances of the IEEE standard, such as Nans/Infinities are actually implelemented in software rather than hardware. The H/W simply raises microtraps and expects some very low-level software to fixup the IEEE semantics. For an application that wants to rely massively on private NaN semantics (e.g. OLAP databases), the result can be a 50-100 times performance hit. I know that several companies, as a result, have just ignored the IEEE semantics and used special "high values" to represent application-defined states.

  4. garry-gd says:

    <a href= http://index3.reezina.ru >samsung sgh p520 armani</a> <a href= http://index5.reezina.ru >������ ����</a> <a href= http://index2.reezina.ru >������� ��� ������ � ��������</a> <a href= http://index4.reezina.ru >miranda ������� ���������</a> <a href= http://index1.reezina.ru >������ �������� ����������� 2008 �������</a>

  5. Deb says:

    Thanks a lot….Very very helpful to me..

  6. Geof Sawaya says:

    Why do you use 150 as your bias and not 127? S/b [1 << [exp bit count – 1]] – 1.  

Skip to main content