Floating-point Representation

In computing, a number with a decimal point is called a floating point number. For example, the number 1 is an integer, but 1.0 is a floating point number.

Scientific Notation

In considering such numbers, some are very large, while others are tiny:

Large numbers
- The distance from the earth to the sum is approximately 92,960,000 miles or about 490,828,800,000 Feet.
- In chemistry, a mole is the number of atoms in 12 grams of carbon 12 — experimentally determined to be 6.02214179 × 10²³ atoms.
- Intel's 8-core i7 processor contains about 2,600,000,000 Transistors.
Tiny numbers
- With current technology, a transistor in a chip may be as small as 7 nanometers or 0.000000007 meters.
- The thickness of a sheet of paper may be approximately 0.1 millimeters or 0.001 meters.
- The wave length of ultraviolet light is less than 400 nanometers or 0.0000004 meters.
- The diameter of a hydrogen atom is about 0.1 nanometers or 0.0000000001 meters.

The need to handle such a remarkable range of numbers naturally motivates the use of scientific notation: move the decimal point after the first digit and record any shift of the decimal point by indicating a power of ten. The following examples show the above numbers in scientific notation.

Number	Decimal Point Shift	Number in Scientific Notation
92,960,000	shift decimal point left 7 places	9.296 × 10⁷
490,828,800,000	shift left 11 places	4.908288 × 10¹¹
6.02214179 × 10²³	already in scientific notation	6.02214179 × 10²³
2,600,000,000	shift decimal point left 9 places	2.6 × 10⁹
0.000000007	shift decimal point right 9 places	7. × 10^-9
0.001	shift decimal point right 3 places	1. × 10^-3
0.0000004	shift decimal point right 7 places	1. × 10^-7
0.0000000001	shift decimal point right 10 places	1. × 10^-10

One important characteristic of this notation is that we can maintain a substantial level of accuracy of each number, regardless of what power of 10 might be involved. For example, 4.908288 × 10¹¹ provides 7 digits of accuracy, whereas the number 4.908288 × 10^-7 still has 7 digits of accuracy.

Jargon: Numbers in scientific notation have the form a × 10ⁿ, where 1 ≤ a < 10 and n is an integer. In this context, a is called the mantissa and n is called the exponent for the identified number. For example, given the number 2,600,000,000, we rewrite it in scientific notation to get 2.6 × 10⁹. With this transformation, 2.6 is the mantissa and 9 is the exponent.

The Floating Point Standard(s) of the Institute of Electrical and Electronics Engineers (IEEE)

Within computers, [almost] all floating point numbers follow standards formulated by the Institute of Electrical and Electronics Engineers (IEEE). These standards utilize a common approach for representing floating point numbers, but variations of the standards allow different numbers of bits.

Within C, the float data type commonly specifies 32-bit storage, often called a single-precision representation.
The double data type commonly specifies 64-bit storage, often called a double-precision representation.

Both float or double storage utilize a binary version of scientific notation. Translation of a decimal number to either float or double follows 4 steps:

Write the number in a binary version of scientific notation, obtaining a binary mantissa and a binary exponent.
Use one bit to represent the sign of the number.
Store the binary mantissa in an efficient way.
Use the binary exponent as a starting point in a computation to obtain a stored exponent.

Differences between float or double relate to how bits are allocated and how certain computations are made.

data type	total bits	sign bit	exponent bits	mantissa bits
`float`	32 bits	1 bit	8 bits	23 bits
`double`	64 bits	1 bit	11 bits	52 bits

Regarding accuracy, 10 bits can represent numbers up to 1023 (about 3 decimal digits of accuracy), so the 23 bits used for float numbers yields about 7 or 8 decimal digits of accuracy. Similar, the 52 bits available for double numbers allows about 16 decimal digits of accuracy.

With this background, we now examine in detail each of the four steps in translating a decimal number to float or double floating point notation.

Writing a Decimal in Binary Scientific Notation

The reading on the Binary Representation of Integers noted that the digits of an integer, written in binary form correspond to powers of two. For integers, the powers were non-negative, but negative powers also are posible. Here are some examples, following a somewhat expanded format from that used in the previous reading.

binary number	0	0	0	1	1	0	1	0	1	0	0	1
powers of 2	2⁷	2⁶	2⁵	2⁴	2³	2²	2¹	2⁰	2^-1	2^-2	2^-3	2^-4
decimal value of power	128	64	32	16	8	4	2	1	1/2=0.5	1/4 = 0.25	1/8 = 0.125	1/16 = 0.0625
column values with binary 1				16	8		2		0.5			0.0625

Putting these pieces together, the binary number 00011010.1001 represents the number 16+8+2+0.5+0.0625 = 26.5625. In interpreting the number 11010.1001, the digits to the left of the point (11010) correspond to positive powers of 2, and the digits to the right (1001) correspond to negative powers. Also, since we are writing binary numbers, the period itself should not be called a decimal point; more properly it is called a binary point or a radix point.

As this example illustrates, parts of a binary number to the left of a binary point are integers, and we can use the discussion of [non-negative] integers to convert from decimal to binary. For parts of a binary number to the right of a binary point, some numbers are easy: 0.5, 0.25, 0.125, etc. correspond to negative powers of two. Combinations of these negative powers (e.g., 0.75 = 0.5 + 0.25) also are easy to represent (e.g., binary 0.11). However, other fractional decimals may require more careful work to find reasonably binary representations. In the interest of time and effort, in what follows we assume

our starting number is an integer,
the number has a decimal representation ending in .00, .25, .50, or .75, or
we already know the binary representation of the number we wish to work with.

Given a binary number, such as 00011010.1001, we can write the number in normalized form (scientific notation) by shifting the radix to come after the initial 1. For 11010.1001, the radix point should be sifted 4 places to the left, obtaining 1.10101001 and an exponent 4 (written 100 in binary). In a mix of binary and decimal, binary 11010.1001 = 1.10101001 × 2¹⁰⁰ — a binary expression, except for our use of "2" which must be raised to the 4th power. Expressed in binary, 1.10101001 is the mantissa, and 100 is the exponent for the initial number 00011010.1001.

Some additional examples illustrate this rewriting of a decimal number into a binary normal form (binary scientific notation).

Decimal number	Binary equivalent	Normalized mantissa	Shift	binary exponent
19	10011	1.0011	4 left	100
87	1010111	1.010111	6 left	110
2718	101010011110	1.01010011110	11 left	1011
34145	1000010101100001	1.000010101100001	15 left	1111

Overall, the algorithm for writing a number in normalized binary form involves three steps:

Convert the number to a binary number.
Shift the binary point, so that it appears just after the initial 1 (the number will be 1.-----). The shifted number is called the mantissa.
Record the number of bits required for the shift in step 2 (consider a left shift as a positive, a right shift as a negative). This amount of shift is the exponent.

Practice

Translate the number

to normalized binary form, by giving both the binary mantissa (with no leading 0's) and the binary exponent.

Answer:
    Mantissa:
    Exponent:

Optional: For the Mathematically Inclined and/or Curious

Using decimal notation, some fractions have an infinite decimal representation. For example, 1/3 = 0.33333333... . One way to compute this decimal representation is to use long division:

       0 . 3 3 3 3 3 ...    
      ------------------
   3 | 1 . 0 0 0 0 0 ...
           9
       -----
           1 0
             9
         -----
             1 0
               9
           -----
               1 0
                 9
             -----
                 1 0
                   9
               -----
                   1  
                    etc.

At each stage of the division (after the first step), we divide 3 into 10, obtain a quotient of 3, and remainder of 1.

In binary notation, a similar situation arises with many fractions. For example, consider the decimal number one tenth (1/10 or 0.1 decimal). To determine the relevant binary representation, note that 10 (decimal) corresponds to 1010 (binary). To compute 0.1 (decimal) we again utilize long division — in binary.

           0 . 0 0 0 1 1 0 0 1 1 0 0 1 ...
          --------------------------------
    1010 | 1 . 0 0 0 0 0 0 0 0 0 0 0 0 ...
               1 0 1 0
               -------
                 1 1 0 0
                 1 0 1 0
                 -------    
                     1 0 0 0 0
                       1 0 1 0
                     -------
                       1 1 0 0
                       1 0 1 0
                       -------    
                           1 0 0 0 0
                             1 0 1 0
                           ---------
                               1 1 0
                                   etc.

After starting 0.00011, the 0011 pattern continues forever, so the decimal number 0.1 cannot be represented with a finite number of digits in binary.

Computing the Sign Bit

Using either single precision or double precision, the first bit represents a sign. As with sign-magnitude notation for integers, 0 is used to represent a positive number and 1 is used for a negative. (We'll worry about representing the number zero later in this reading.)

For example, consider the representation of ±87.25. We still have to discuss details of storing the mantissa and the exponent. However, the number will start:

0?????????... for +87.25, and
1?????????... for -87.25

Storing the Binary Mantissa

When writing a number in (decimal-based) scientific notation, the leading digit may be 1, 2, ..., 9. For example, some decimal numbers at the start of this reading included 1. × 10^-10, 2.6 × 10⁹, 4.908288 × 10¹¹, 6.02214179 × 10²³, 7. × 10^-9, and 9.296 × 10⁷.

With binary normalized form (e.g., binary scientific form), the radix point moves after the first non-zero digit — but in binary, the only possible non-zero digit is 1. Thus, a number in binary normalized form must begin 1.0-----.

With this property that all normalized binary numbers begin 1.?????, the IEEE Floating Point Standards observe that there is no need to store the leading bit. The bit will always be 1, so we can save space by storing the mantissa starting with the second bit. For example, for the mantissa 1.10101001, the bits actually stored are 10101001 — the leading 1 is not stored.

These stored bits are place at the left of the mantissa section of the single- or double-precision IEEE floating point number. Since single-precision numbers require 23-bit mantissas and double-precision numbers require 52-bit mantissas. Once the desired mantissa is placed at the left of this field, the rest of the space is filled with 0's. Thus, for the mantissa 1.10101001, the actual mantissa stored in single-precision format is 10101001000000000000000.

In summary, given the binary mantissa of a floating point number:

Remove the initial 1
Place the remaining mantissa into the 23 bits (single precision) or 52 bits (double precision), beginning at the left of the prescribed bit field.
- If the given mantissa contains more bits than will fit in the prescribed space, just take the first 23 or 52 bits.
- If the given mantissa contains fewer bits the required number of bits, place the bits computed at the left of the space, and fill the remaining bits with 0's.

Precision	Exponent range	Bias
Single Precision	-126 to +127	127
Double Precision	-1022 to +1023	1023

Number	Number × 2^-126	Stored Mantissa
0	0.0000... × 2^-126	000000...
2^-127	0.100000... × 2^-126	100000...
1.1011 × 2^-129	0.0011011 × 2^-126	00110110000...

Binary Representation of Floating-point Numbers

Scientific Notation

The Floating Point Standard(s) of the Institute of Electrical and Electronics Engineers (IEEE)

Writing a Decimal in Binary Scientific Notation

Practice

Optional: For the Mathematically Inclined and/or Curious

Computing the Sign Bit

Storing the Binary Mantissa

Storing the Binary Exponent

Normalized Exponent Summary

A Closer Look at Exponents

What About Zero and Tiny Numbers

What About the Stored Exponent 111...111 ?

The Full IEEE Floating Point Number

Practice with Single-precision Floating Point

sign	stored exponent	stored mantissa
1 bit	8 bits	23 bits

	0 bits typed	0 bits typed

created 4 April 2016 by Henry M. Walker expanded and edited 7 April 2016 by Henry M. Walker reformatted for CS 415 24 July 2022 by Henry M. Walker
For more information, please contact Henry M. Walker at walker@cs.grinnell.edu.