IEEE Standard 754 for Floating Point Numbers

This article specifies how single precision (32 bit) and double precision (64 bit) floating point numbers are to be represented as per IEEE Standard 754.

Floating point Representation:

 

Floating-point representation represents real numbers in scientific notation. Scientific notation represents numbers as a base number and an exponent. For example, in decimal, 123.456 could be represented as 1.23456 × 102.

In binary, the number 1100.111 might be represented as 1.10111 × 23. Here, the value part i.e. 1.10111 is referred to as “mantissa” and the power part, i.e. 3 is called “exponent”. 2 here is referred as base of the exponent.

From the storage layout point of view, floating point numbers have three components: the sign, the exponent, and the mantissa.

IEEE floating point numbers come in two sizes, 32-bit single precision and 64-bit double precision numbers. The layouts for the parts of a floating point number are:

Single-Precision

Sign

Exponent

Fraction

Bit Positions

31

30-23

22-00

Number of bits

1

8

23

Bias

127

Double-Precision

Sign

Exponent

Fraction

Bit Positions

63

62-52

51-00

Number of bits

1

11

52

Bias

1023

The Sign

A zero in the sign bit indicates that the number is positive; a one indicates a negative number.

The Exponent

 

The exponent base 2 is implicit and is not stored. In order to keep things simple, the exponent is not stored as a signed number. To accomplish the same, a bias is added to the actual exponent in order to get the stored exponent. A single-precision number uses eight bits for the exponent, so it should be capable of storing exponents ranging from -127 through +127. But the value actually stored is the exponent plus the bias. Thus, the bias for single precision numbers is 127. Similarly, the bias for double precision numbers is 1023. This means that the value stored will range from zero to 255 for a single, and zero to 2047 for a double.

    For example, in case of single precision float, a stored exponent of 150 means that the actual exponent is 23.

    It also needs to be noted that exponents of having all the bits as 0 or having all the bits as 1 are used for special numbers.

 

The Mantissa

 

The mantissa represents the precision bits of the number. It has an implicit leading bit (1) and the fraction bits.

    To find out the value of the implicit leading bit, consider that a binary number can be expressed in scientific notation in different ways like 1.0011 x 23 or 100.11 x 21. Now, the mantissa is normalized so that the most significant digit is just to the left of the decimal point. Since in case of binary, the only possible non-zero digit is 1, the leading digit of 1 can be ignored, and does not need to be represented explicitly. As a result, the mantissa for example in case of single precision float has effectively 24 bits of resolution, by way of 23 fraction bits.

Thus,

  • The sign bit is 0 for positive, 1 for negative.

  • The exponent's base is two.

  • The exponent field contains bias plus the true exponent

  • The mantissa would always looks like 1.f, where f is the field of fraction bits.

Now, let’s see what goes into 32 bits of a single precision float when we assign 2865412.25 to it:

First, get the sign:

Since the sign of the number is positive, a 0 goes into the top bit.

Second, convert the number to binary:

2865412.25 is 1010111011100100000100. 01 in binary (2865412 converted to binary is 1010111011100100000100, and .25 decimal is .01 = {0 * 2-1 + 1 * 2-2 } in binary.)

Third, normalize the number:

We can normalize the binary to 1.01011101110010000010001 * 221

Fourth, store the exponent:

Adding in the bias, the exponent will be stored as 21 + 127 = 148, or 10010100 written in binary.

Fifth, store the mantissa

The mantissa is handled by dropping the most significant bit, leaving us with . 01011101110010000010001. Since, we can now take the next 23 bits allowed for mantissa, we have the mantissa as :

     0101 1101 1100 1000 0010 001

     or 2EE411 in hexadecimal.

 

Thus, the result is stored in 32-bits as :

 

Sign

Exponent

Mantissa

0

10010100

0101 1101 1100 1000 0010 001 

i.e 4A2EE411 in hex.

 

You can also visit the following link which provides visual tool to covert IEEE-754 hexadecimal representations to decimal Floating-Point Numbers and vice versa and gives awesome details of all the components:

https://babbage.cs.qc.edu/courses/cs341/IEEE-754.html

 

Check my next blog which would contain details regarding range of values for IEEE-754 floating point numbers, Denormalized forms, NAN and some simple algorithms for conversion between IEEE-754 hexadecimal representations and decimal Floating-Point Numbers.