Binary Representation of Floating-point Numbers
In computing, a number with a decimal point is called a floating point number. For example, the number 1 is an integer, but 1.0 is a floating point number.
Scientific Notation
In considering such numbers, some are very large, while others are tiny:
-
Large numbers
- The distance from the earth to the sum is approximately 92,960,000 miles or about 490,828,800,000 Feet.
- In chemistry, a mole is the number of atoms in 12 grams of carbon 12 — experimentally determined to be 6.02214179 × 1023 atoms.
- Intel's 8-core i7 processor contains about 2,600,000,000 Transistors.
-
Tiny numbers
- With current technology, a transistor in a chip may be as small as 7 nanometers or 0.000000007 meters.
- The thickness of a sheet of paper may be approximately 0.1 millimeters or 0.001 meters.
- The wave length of ultraviolet light is less than 400 nanometers or 0.0000004 meters.
- The diameter of a hydrogen atom is about 0.1 nanometers or 0.0000000001 meters.
The need to handle such a remarkable range of numbers naturally motivates the use of scientific notation: move the decimal point after the first digit and record any shift of the decimal point by indicating a power of ten. The following examples show the above numbers in scientific notation.
Number | Decimal Point Shift | Number in Scientific Notation |
---|---|---|
92,960,000 | shift decimal point left 7 places | 9.296 × 107 |
490,828,800,000 | shift left 11 places | 4.908288 × 1011 |
6.02214179 × 1023 | already in scientific notation | 6.02214179 × 1023 |
2,600,000,000 | shift decimal point left 9 places | 2.6 × 109 |
0.000000007 | shift decimal point right 9 places | 7. × 10-9 |
0.001 | shift decimal point right 3 places | 1. × 10-3 |
0.0000004 | shift decimal point right 7 places | 1. × 10-7 |
0.0000000001 | shift decimal point right 10 places | 1. × 10-10 |
One important characteristic of this notation is that we can maintain a substantial level of accuracy of each number, regardless of what power of 10 might be involved. For example, 4.908288 × 1011 provides 7 digits of accuracy, whereas the number 4.908288 × 10-7 still has 7 digits of accuracy.
Jargon: Numbers in scientific notation have the form a × 10n, where 1 ≤ a < 10 and n is an integer. In this context, a is called the mantissa and n is called the exponent for the identified number. For example, given the number 2,600,000,000, we rewrite it in scientific notation to get 2.6 × 109. With this transformation, 2.6 is the mantissa and 9 is the exponent.
The Floating Point Standard(s) of the Institute of Electrical and Electronics Engineers (IEEE)
Within computers, [almost] all floating point numbers follow standards formulated by the Institute of Electrical and Electronics Engineers (IEEE). These standards utilize a common approach for representing floating point numbers, but variations of the standards allow different numbers of bits.
- Within C, the float data type commonly specifies 32-bit storage, often called a single-precision representation.
- The double data type commonly specifies 64-bit storage, often called a double-precision representation.
Both float or double storage utilize a binary version of scientific notation. Translation of a decimal number to either float or double follows 4 steps:
- Write the number in a binary version of scientific notation, obtaining a binary mantissa and a binary exponent.
- Use one bit to represent the sign of the number.
- Store the binary mantissa in an efficient way.
- Use the binary exponent as a starting point in a computation to obtain a stored exponent.
Differences between float or double relate to how bits are allocated and how certain computations are made.
data type | total bits | sign bit | exponent bits | mantissa bits |
---|---|---|---|---|
float | 32 bits | 1 bit | 8 bits | 23 bits |
double | 64 bits | 1 bit | 11 bits | 52 bits |
Regarding accuracy, 10 bits can represent numbers up to 1023 (about 3 decimal digits of accuracy), so the 23 bits used for float numbers yields about 7 or 8 decimal digits of accuracy. Similar, the 52 bits available for double numbers allows about 16 decimal digits of accuracy.
With this background, we now examine in detail each of the four steps in translating a decimal number to float or double floating point notation.
Writing a Decimal in Binary Scientific Notation
The reading on the Binary Representation of Integers noted that the digits of an integer, written in binary form correspond to powers of two. For integers, the powers were non-negative, but negative powers also are posible. Here are some examples, following a somewhat expanded format from that used in the previous reading.
binary number | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
powers of 2 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 2-1 | 2-2 | 2-3 | 2-4 |
decimal value of power | 128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 | 1/2=0.5 | 1/4 = 0.25 | 1/8 = 0.125 | 1/16 = 0.0625 |
column values with binary 1 | 16 | 8 | 2 | 0.5 | 0.0625 |
Putting these pieces together, the binary number 00011010.1001 represents the number 16+8+2+0.5+0.0625 = 26.5625. In interpreting the number 11010.1001, the digits to the left of the point (11010) correspond to positive powers of 2, and the digits to the right (1001) correspond to negative powers. Also, since we are writing binary numbers, the period itself should not be called a decimal point; more properly it is called a binary point or a radix point.
As this example illustrates, parts of a binary number to the left of a binary point are integers, and we can use the discussion of [non-negative] integers to convert from decimal to binary. For parts of a binary number to the right of a binary point, some numbers are easy: 0.5, 0.25, 0.125, etc. correspond to negative powers of two. Combinations of these negative powers (e.g., 0.75 = 0.5 + 0.25) also are easy to represent (e.g., binary 0.11). However, other fractional decimals may require more careful work to find reasonably binary representations. In the interest of time and effort, in what follows we assume
- our starting number is an integer,
- the number has a decimal representation ending in .00, .25, .50, or .75, or
- we already know the binary representation of the number we wish to work with.
Given a binary number, such as 00011010.1001, we can write the number in normalized form (scientific notation) by shifting the radix to come after the initial 1. For 11010.1001, the radix point should be sifted 4 places to the left, obtaining 1.10101001 and an exponent 4 (written 100 in binary). In a mix of binary and decimal, binary 11010.1001 = 1.10101001 × 2100 — a binary expression, except for our use of "2" which must be raised to the 4th power. Expressed in binary, 1.10101001 is the mantissa, and 100 is the exponent for the initial number 00011010.1001.
Some additional examples illustrate this rewriting of a decimal number into a binary normal form (binary scientific notation).
Decimal number | Binary equivalent | Normalized mantissa | Shift | binary exponent |
19 | 10011 | 1.0011 | 4 left | 100 |
87 | 1010111 | 1.010111 | 6 left | 110 |
2718 | 101010011110 | 1.01010011110 | 11 left | 1011 |
34145 | 1000010101100001 | 1.000010101100001 | 15 left | 1111 |
Overall, the algorithm for writing a number in normalized binary form involves three steps:
- Convert the number to a binary number.
- Shift the binary point, so that it appears just after the initial 1 (the number will be 1.-----). The shifted number is called the mantissa.
- Record the number of bits required for the shift in step 2 (consider a left shift as a positive, a right shift as a negative). This amount of shift is the exponent.
Practice
Translate the number
to normalized binary form, by giving both the binary mantissa (with no leading 0's) and the binary exponent.
Answer:
Mantissa:
Exponent:
Optional: For the Mathematically Inclined and/or Curious
Using decimal notation, some fractions have an infinite decimal representation. For example, 1/3 = 0.33333333... . One way to compute this decimal representation is to use long division:
0 . 3 3 3 3 3 ... ------------------ 3 | 1 . 0 0 0 0 0 ... 9 ----- 1 0 9 ----- 1 0 9 ----- 1 0 9 ----- 1 0 9 ----- 1 etc.
At each stage of the division (after the first step), we divide 3 into 10, obtain a quotient of 3, and remainder of 1.
In binary notation, a similar situation arises with many fractions. For example, consider the decimal number one tenth (1/10 or 0.1 decimal). To determine the relevant binary representation, note that 10 (decimal) corresponds to 1010 (binary). To compute 0.1 (decimal) we again utilize long division — in binary.
0 . 0 0 0 1 1 0 0 1 1 0 0 1 ... -------------------------------- 1010 | 1 . 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 ------- 1 1 0 0 1 0 1 0 ------- 1 0 0 0 0 1 0 1 0 ------- 1 1 0 0 1 0 1 0 ------- 1 0 0 0 0 1 0 1 0 --------- 1 1 0 etc.
After starting 0.00011, the 0011 pattern continues forever, so the decimal number 0.1 cannot be represented with a finite number of digits in binary.
Computing the Sign Bit
Using either single precision or double precision, the first bit represents a sign. As with sign-magnitude notation for integers, 0 is used to represent a positive number and 1 is used for a negative. (We'll worry about representing the number zero later in this reading.)
For example, consider the representation of ±87.25. We still have to discuss details of storing the mantissa and the exponent. However, the number will start:
- 0?????????... for +87.25, and
- 1?????????... for -87.25
Storing the Binary Mantissa
When writing a number in (decimal-based) scientific notation, the leading digit may be 1, 2, ..., 9. For example, some decimal numbers at the start of this reading included 1. × 10-10, 2.6 × 109, 4.908288 × 1011, 6.02214179 × 1023, 7. × 10-9, and 9.296 × 107.
With binary normalized form (e.g., binary scientific form), the radix point moves after the first non-zero digit — but in binary, the only possible non-zero digit is 1. Thus, a number in binary normalized form must begin 1.0-----.
With this property that all normalized binary numbers begin 1.?????, the IEEE Floating Point Standards observe that there is no need to store the leading bit. The bit will always be 1, so we can save space by storing the mantissa starting with the second bit. For example, for the mantissa 1.10101001, the bits actually stored are 10101001 — the leading 1 is not stored.
These stored bits are place at the left of the mantissa section of the single- or double-precision IEEE floating point number. Since single-precision numbers require 23-bit mantissas and double-precision numbers require 52-bit mantissas. Once the desired mantissa is placed at the left of this field, the rest of the space is filled with 0's. Thus, for the mantissa 1.10101001, the actual mantissa stored in single-precision format is 10101001000000000000000.
In summary, given the binary mantissa of a floating point number:
- Remove the initial 1
-
Place the remaining mantissa into the 23 bits (single precision) or 52 bits
(double precision), beginning at the left of the prescribed bit field.
- If the given mantissa contains more bits than will fit in the prescribed space, just take the first 23 or 52 bits.
- If the given mantissa contains fewer bits the required number of bits, place the bits computed at the left of the space, and fill the remaining bits with 0's.
A Closer Look at Exponents
This process ensures that the stored exponent for normalized binary numbers with exponents in range will be neither all 0's nor all 1's.
What About Zero and Tiny Numbers
The storage of normalized binary numbers works fine in many cases, but it fails for the number zero and for numbers with exponents smaller than 2-126 (single-precision) or smaller than 2-1023 (double-precision).
- For zero, no shifting of a binary point will yield a position after a 1.
- For tiny numbers, an exponent below the range (e.g., below 2-126 (single-precision)) will be too small to yield a positive number when the 127 bias is added.
Such numbers are stored un-normalized and considered to be multiplied by 2-126 (single-precision) or 2-1023 (double-precision). That is, the number is written in the form
and the mantissa is stored directly. Some examples for single-precision follow:
Number | Number × 2-126 | Stored Mantissa |
---|---|---|
0 | 0.0000... × 2-126 | 000000... |
2-127 | 0.100000... × 2-126 | 100000... |
1.1011 × 2-129
0.0011011 × 2-126
| 00110110000...
| |
What About the Stored Exponent 111...111 ?
So far, all stored exponents for floating point numbers have been less than all 1's (111...111). Within the IEEE Standards, an exponent of all 1's is reserved for various error conditions — not for actual numbers.
- The pseudo-number "positive infinity" — a number too large to
represent in the space available — is represented by
- sign bit 0,
- exponent 111...111, and
- mantissa all 0's
- The pseudo-number "negative infinity" — a number too large a
negative to represent in the space available — is represented by
- sign bit 1,
- exponent 111...111, and
- mantissa all 0's
- Error conditions, sometimes designated NaN for ``Not a Number'', are designated with exponent 111...111 and a non-zero mantissa. As an example, such a circumstance might arise if the computer tried to divide by zero.
The Full IEEE Floating Point Number
Combining the full discussion of representing floating point numbers within single-precision (32-bit) or double-precision (64-bit), the following rules apply for most circumstances:
- If the number is zero, the full representation is all 0's.
-
If the number is non-zero:
- Write the number in a normalized, binary format: mantissa × 2binary exponent. (In what follows, Assume the binary exponent is within a specified range.)
- Assuming the binary exponent is within a specified range, the first bit of the stored floating-point number represents the sign: 0 for positive, 1 for negative.
- Store the exponent in 8 bits (single precision) or 11 bits (double precision), after adding the required bias (127 or 1023, respectively).
- Store the mantissa in 23 or 52 bits, respectively, omitting the leading 1, placing the resulting mantissa on the left, and filling with 0's on the right as needed.
Practice with Single-precision Floating Point
Translate the number
to 32-bit IEEE floating-point format.
Answer: Fill in the blanks
sign | stored exponent | stored mantissa |
1 bit | 8 bits | 23 bits |
0 bits typed | 0 bits typed |
created 4 April 2016 by Henry M. Walker expanded and edited 7 April 2016 by Henry M. Walker |
![]() ![]() |
For more information, please contact Henry M. Walker at walker@cs.grinnell.edu. |