Binary Representation of Floating-point Numbers

In computing, a number with a decimal point is called a floating point number. For example, the number 1 is an integer, but 1.0 is a floating point number.

Scientific Notation

In considering such numbers, some are very large, while others are tiny:

Large numbers
- The distance from the earth to the sum is approximately 92,960,000 miles or about 490,828,800,000 Feet.
- In chemistry, a mole is the number of atoms in 12 grams of carbon 12 — experimentally determined to be 6.02214179 × 10²³ atoms.
- Intel's 8-core i7 processor contains about 2,600,000,000 Transistors.
Tiny numbers
- With current technology, a transistor in a chip may be as small as 7 nanometers or 0.000000007 meters.
- The thickness of a sheet of paper may be approximately 0.1 millimeters or 0.001 meters.
- The wave length of ultraviolet light is less than 400 nanometers or 0.0000004 meters.
- The diameter of a hydrogen atom is about 0.1 nanometers or 0.0000000001 meters.

The need to handle such a remarkable range of numbers naturally motivates the use of scientific notation: move the decimal point after the first digit and record any shift of the decimal point by indicating a power of ten. The following examples show the above numbers in scientific notation.

Number	Decimal Point Shift	Number in Scientific Notation
92,960,000	shift decimal point left 7 places	9.296 × 10⁷
490,828,800,000	shift left 11 places	4.908288 × 10¹¹
6.02214179 × 10²³	already in scientific notation	6.02214179 × 10²³
2,600,000,000	shift decimal point left 9 places	2.6 × 10⁹
0.000000007	shift decimal point right 9 places	7. × 10^-9
0.001	shift decimal point right 3 places	1. × 10^-3
0.0000004	shift decimal point right 7 places	1. × 10^-7
0.0000000001	shift decimal point right 10 places	1. × 10^-10

One important characteristic of this notation is that we can maintain a substantial level of accuracy of each number, regardless of what power of 10 might be involved. For example, 4.908288 × 10¹¹ provides 7 digits of accuracy, whereas the number 4.908288 × 10^-7 still has 7 digits of accuracy.

Jargon: Numbers in scientific notation have the form a × 10ⁿ, where 1 ≤ a < 10 and n is an integer. In this context, a is called the mantissa and n is called the exponent for the identified number. For example, given the number 2,600,000,000, we rewrite it in scientific notation to get 2.6 × 10⁹. With this transformation, 2.6 is the mantissa and 9 is the exponent.

The Floating Point Standard(s) of the Institute of Electrical and Electronics Engineers (IEEE)

Within computers, [almost] all floating point numbers follow standards formulated by the Institute of Electrical and Electronics Engineers (IEEE). These standards utilize a common approach for representing floating point numbers, but variations of the standards allow different numbers of bits.

Within C, the float data type commonly specifies 32-bit storage, often called a single-precision representation.
The double data type commonly specifies 64-bit storage, often called a double-precision representation.

Both float or double storage utilize a binary version of scientific notation. Translation of a decimal number to either float or double follows 4 steps:

Write the number in a binary version of scientific notation, obtaining a binary mantissa and a binary exponent.
Use one bit to represent the sign of the number.
Store the binary mantissa in an efficient way.
Use the binary exponent as a starting point in a computation to obtain a stored exponent.

Differences between float or double relate to how bits are allocated and how certain computations are made.

data type	total bits	sign bit	exponent bits	mantissa bits
`float`	32 bits	1 bit	8 bits	23 bits
`double`	64 bits	1 bit	11 bits	52 bits

Regarding accuracy, 10 bits can represent numbers up to 1023 (about 3 decimal digits of accuracy), so the 23 bits used for float numbers yields about 7 or 8 decimal digits of accuracy. Similar, the 52 bits available for double numbers allows about 16 decimal digits of accuracy.

With this background, we now examine in detail each of the four steps in translating a decimal number to float or double floating point notation.

Writing a Decimal in Binary Scientific Notation

The reading on the Binary Representation of Integers noted that the digits of an integer, written in binary form correspond to powers of two. For integers, the powers were non-negative, but negative powers also are posible. Here are some examples, following a somewhat expanded format from that used in the previous reading.

binary number	0	0	0	1	1	0	1	0	1	0	0	1
powers of 2	2⁷	2⁶	2⁵	2⁴	2³	2²	2¹	2⁰	2^-1	2^-2	2^-3	2^-4
decimal value of power	128	64	32	16	8	4	2	1	1/2=0.5	1/4 = 0.25	1/8 = 0.125	1/16 = 0.0625
column values with binary 1				16	8		2		0.5			0.0625

Putting these pieces together, the binary number 00011010.1001 represents the number 16+8+2+0.5+0.0625 = 26.5625. In interpreting the number 11010.1001, the digits to the left of the point (11010) correspond to positive powers of 2, and the digits to the right (1001) correspond to negative powers. Also, since we are writing binary numbers, the period itself should not be called a decimal point; more properly it is called a binary point or a radix point.

As this example illustrates, parts of a binary number to the left of a binary point are integers, and we can use the discussion of [non-negative] integers to convert from decimal to binary. For parts of a binary number to the right of a binary point, some numbers are easy: 0.5, 0.25, 0.125, etc. correspond to negative powers of two. Combinations of these negative powers (e.g., 0.75 = 0.5 + 0.25) also are easy to represent (e.g., binary 0.11). However, other fractional decimals may require more careful work to find reasonably binary representations. In the interest of time and effort, in what follows we assume

our starting number is an integer,
the number has a decimal representation ending in .00, .25, .50, or .75, or
we already know the binary representation of the number we wish to work with.

Given a binary number, such as 00011010.1001, we can write the number in normalized form (scientific notation) by shifting the radix to come after the initial 1. For 11010.1001, the radix point should be sifted 4 places to the left, obtaining 1.10101001 and an exponent 4 (written 100 in binary). In a mix of binary and decimal, binary 11010.1001 = 1.10101001 × 2¹⁰⁰ — a binary expression, except for our use of "2" which must be raised to the 4th power. Expressed in binary, 1.10101001 is the mantissa, and 100 is the exponent for the initial number 00011010.1001.

Some additional examples illustrate this rewriting of a decimal number into a binary normal form (binary scientific notation).

Decimal number	Binary equivalent	Normalized mantissa	Shift	binary exponent
19	10011	1.0011	4 left	100
87	1010111	1.010111	6 left	110
2718	101010011110	1.01010011110	11 left	1011
34145	1000010101100001	1.000010101100001	15 left	1111

Overall, the algorithm for writing a number in normalized binary form involves three steps:

Convert the number to a binary number.
Shift the binary point, so that it appears just after the initial 1 (the number will be 1.-----). The shifted number is called the mantissa.
Record the number of bits required for the shift in step 2 (consider a left shift as a positive, a right shift as a negative). This amount of shift is the exponent.

Practice

Translate the number

to normalized binary form, by giving both the binary mantissa (with no leading 0's) and the binary exponent.

Answer:
    Mantissa:
    Exponent:

Optional: For the Mathematically Inclined and/or Curious

Using decimal notation, some fractions have an infinite decimal representation. For example, 1/3 = 0.33333333... . One way to compute this decimal representation is to use long division:

       0 . 3 3 3 3 3 ...    
      ------------------
   3 | 1 . 0 0 0 0 0 ...
           9
       -----
           1 0
             9
         -----
             1 0
               9
           -----
               1 0
                 9
             -----
                 1 0
                   9
               -----
                   1  
                    etc.

At each stage of the division (after the first step), we divide 3 into 10, obtain a quotient of 3, and remainder of 1.

In binary notation, a similar situation arises with many fractions. For example, consider the decimal number one tenth (1/10 or 0.1 decimal). To determine the relevant binary representation, note that 10 (decimal) corresponds to 1010 (binary). To compute 0.1 (decimal) we again utilize long division — in binary.

           0 . 0 0 0 1 1 0 0 1 1 0 0 1 ...
          --------------------------------
    1010 | 1 . 0 0 0 0 0 0 0 0 0 0 0 0 ...
               1 0 1 0
               -------
                 1 1 0 0
                 1 0 1 0
                 -------    
                     1 0 0 0 0
                       1 0 1 0
                     -------
                       1 1 0 0
                       1 0 1 0
                       -------    
                           1 0 0 0 0
                             1 0 1 0
                           ---------
                               1 1 0
                                   etc.

After starting 0.00011, the 0011 pattern continues forever, so the decimal number 0.1 cannot be represented with a finite number of digits in binary.

Computing the Sign Bit

Using either single precision or double precision, the first bit represents a sign. As with sign-magnitude notation for integers, 0 is used to represent a positive number and 1 is used for a negative. (We'll worry about representing the number zero later in this reading.)

For example, consider the representation of ±87.25. We still have to discuss details of storing the mantissa and the exponent. However, the number will start:

0?????????... for +87.25, and
1?????????... for -87.25

Storing the Binary Mantissa

When writing a number in (decimal-based) scientific notation, the leading digit may be 1, 2, ..., 9. For example, some decimal numbers at the start of this reading included 1. × 10^-10, 2.6 × 10⁹, 4.908288 × 10¹¹, 6.02214179 × 10²³, 7. × 10^-9, and 9.296 × 10⁷.

With binary normalized form (e.g., binary scientific form), the radix point moves after the first non-zero digit — but in binary, the only possible non-zero digit is 1. Thus, a number in binary normalized form must begin 1.0-----.

With this property that all normalized binary numbers begin 1.?????, the IEEE Floating Point Standards observe that there is no need to store the leading bit. The bit will always be 1, so we can save space by storing the mantissa starting with the second bit. For example, for the mantissa 1.10101001, the bits actually stored are 10101001 — the leading 1 is not stored.

These stored bits are place at the left of the mantissa section of the single- or double-precision IEEE floating point number. Since single-precision numbers require 23-bit mantissas and double-precision numbers require 52-bit mantissas. Once the desired mantissa is placed at the left of this field, the rest of the space is filled with 0's. Thus, for the mantissa 1.10101001, the actual mantissa stored in single-precision format is 10101001000000000000000.

In summary, given the binary mantissa of a floating point number:

Remove the initial 1
Place the remaining mantissa into the 23 bits (single precision) or 52 bits (double precision), beginning at the left of the prescribed bit field.
- If the given mantissa contains more bits than will fit in the prescribed space, just take the first 23 or 52 bits.
- If the given mantissa contains fewer bits the required number of bits, place the bits computed at the left of the space, and fill the remaining bits with 0's.

Storing the Binary Exponent

Our discussion of the storage of floating point numbers has started with a normalized, binary representation:

mantissa × 2^{binary exponent}

For many floating point numbers, storage of the exponent is [almost] straightforward — but with a twist. For floating point numbers with an exponent in a specified range, a bias value is added.

Precision	Exponent range	Bias
Single Precision	-126 to +127	127
Double Precision	-1022 to +1023	1023

For example, consider the 11010.1001. We have noted that the mantissa in binary normal form is 1.10101001. With a shift of the binary point 4 digits to the left, the exponent is 4 (100 in binary).

To represent the exponent 4 in a single-precision representation, add the bias 127 to obtain 4+127=131 or 10000011 (binary), and the stored exponent is 10000011.
To represent the exponent 4 in a double-precision representation, add the bias 1023 to obtain 4+1023=1027 or 10000000011 (binary), and the stored exponent is 10000000011.

With the constraint on the range allowed for exponents in normalized binary form, and with the addition of the bias, note that all stored exponents will be greater than 1, but less than 111...111 . Thus, the exponents stored for normalized binary numbers will not be all 0's and not all 1's. For most numbers we might use, this storage of floating point numbers (eith single- or double-precision) will work without difficulty!

Normalized Exponent Summary

For normalized binary numbers with exponents in the specified range (e.g., -126 to +127 for float type, storage of the exponent follows two simple steps:

Add the bias (127 for float, 1023 for double)
Store the result

A Closer Look at Exponents

This process ensures that the stored exponent for normalized binary numbers with exponents in range will be neither all 0's nor all 1's.

What About Zero and Tiny Numbers

The storage of normalized binary numbers works fine in many cases, but it fails for the number zero and for numbers with exponents smaller than 2^-126 (single-precision) or smaller than 2^-1023 (double-precision).

For zero, no shifting of a binary point will yield a position after a 1.
For tiny numbers, an exponent below the range (e.g., below 2^-126 (single-precision)) will be too small to yield a positive number when the 127 bias is added.

Such numbers are stored un-normalized and considered to be multiplied by 2^-126 (single-precision) or 2^-1023 (double-precision). That is, the number is written in the form

mantissa × 2^-126

and the mantissa is stored directly. Some examples for single-precision follow:

Number	Number × 2^-126	Stored Mantissa
0	0.0000... × 2^-126	000000...
2^-127	0.100000... × 2^-126	100000...
1.1011 × 2^-129	0.0011011 × 2^-126	00110110000...

What About the Stored Exponent 111...111 ?

So far, all stored exponents for floating point numbers have been less than all 1's (111...111). Within the IEEE Standards, an exponent of all 1's is reserved for various error conditions — not for actual numbers.

The pseudo-number "positive infinity" — a number too large to represent in the space available — is represented by
- sign bit 0,
- exponent 111...111, and
- mantissa all 0's
For single-precision float values, this is 01111111100000000000000000000000
The pseudo-number "negative infinity" — a number too large a negative to represent in the space available — is represented by
- sign bit 1,
- exponent 111...111, and
- mantissa all 0's
For single-precision float values, this is 11111111100000000000000000000000
Error conditions, sometimes designated NaN for ``Not a Number'', are designated with exponent 111...111 and a non-zero mantissa. As an example, such a circumstance might arise if the computer tried to divide by zero.

The Full IEEE Floating Point Number

Combining the full discussion of representing floating point numbers within single-precision (32-bit) or double-precision (64-bit), the following rules apply for most circumstances:

If the number is zero, the full representation is all 0's.
If the number is non-zero:
- Write the number in a normalized, binary format: mantissa × 2^{binary exponent}. (In what follows, Assume the binary exponent is within a specified range.)
- Assuming the binary exponent is within a specified range, the first bit of the stored floating-point number represents the sign: 0 for positive, 1 for negative.
- Store the exponent in 8 bits (single precision) or 11 bits (double precision), after adding the required bias (127 or 1023, respectively).
- Store the mantissa in 23 or 52 bits, respectively, omitting the leading 1, placing the resulting mantissa on the left, and filling with 0's on the right as needed.

Practice with Single-precision Floating Point