Lab on Data Representation Consequences

This assignment explores some practical consequences of the representation of data in program processing.

Integer Overflow

Recall from your prior work in C/C++ that constants variables INT_MIN and INT_MAX in C/C++ contain the smallest and largest int values available in C/C++s with the current hardware and compiler.

Finding Integer Averages: Given two integers, i and j, an expression is desired to compute their average (as an integer).

Notes, since this problem involves integers:
- In case the arithmetic average is a real number ending in .5, the average may be rounded either up or down. Thus, either 7 or 8 should be considered as a correct [integer] average of 6 and 9.
- In the case that the integers, i and j, are the same, their average (of course) should be i (or j). Thus, the average of 5 and 5 should be 5, and the average of 6 and 6 should be 6.
- A program should avoid issues of overflow to the extent possible. (Pragmatically, arithmetic with INT_MIN and INT_MAX may be troublesome, but such difficult cases should be kept to a minimum.)
Consider Tables 1 and 2 below, in which the first column identifies five expressions that might be used to compute the average of i and j.
1. Complete Table 1 below, assuming i and j are non-negative integers. That is,
  - For each empty cell under the top heading "overflow possible", indicate
    - "yes" if the statement is true (or OK) for the cases specified in the column header, or
    - "no" otherwise.
  - For each empty cell under the top heading "ignoring overflow, expression gives correct answer", indicate
    - "always", if the expression consistently computes the correct average within the conditions given in the column header.
    - "sometimes", if the expression gives the correct average for some i, j values, but not others, given the conditions indicated in the column header.
    - "never", if the expression never computes a correct answer, given the conditions indicated in the column header.
2. Based on your work in Step a, what, if any, expression(s) would you recommend for a program that needs to compute the average of two non-negative integers. Explain briefly.
3. Complete Table 2 below, assuming i and j can be any integers (positive, negative, or zero). In completing Table 2:
  - Use the same options, "yes", "no", "always", "sometimes", or "never", as in part a.
  - Note that if an entry in Table 1 is "no", "sometimes" or "never", then the corresponding entry in Table 2 may be similar.
4. Based on your work in Step c, what, if any, expression(s) would you recommend for a program that needs to compute the average of two arbitrary integers. Explain briefly.

Consider the program integer-average.c.
1. Compile and run the program, and record what int values are possible within C programs.
2. Review the program to determine how the values of arr1 are computed, and how the value of sum compares to INT_MAX
3. Check the program output. Is the computation of the average of values for arr1 correct?
4. Answer parts b and c for array arr2. What is different in the processing? To the extent that you can, explain why the average computation for this array yields an incorrect result.

Storage of Real Numbers and its Accuracy

The international standard for 64-bit floating-point numbers (often the basis for a double in C/C++) uses a binary version of scientific notation, with a sign, an exponent (a power of 2 in binary) and a mantissa (also as a binary number). With the internation standard for 64-bit floating point numbers, the bits are allocated as follows:

1 bit: sign (plus or minus)
11 bits: an exponent
52 bits: the mantissa

See Binary Representation of Floating-point Numbers for details.

In practice, this international binary standard does not store the leading mantissa bit (because in scientific notation for binary numbers, the leading bit is always 1 (the number 0 is treated in a different way). Thus, since the 64-bit standard explicitly stores 52 bits for the mantissa, this format actually can provide 53 bits of accuracy for stored numbers.

In binary, the decimal number 1023 can be represented with 10 bits. Thus, the decimal number 1,000 can be stored in about 10 binary bits, and 3 decimal-digit numbers require 10 binary bits. Using this perspective, the decimal number 1,000,000 can be stored with about 20 bits, and about 6 decimal-digit numbers require about 20 bits. Continuing this insight, about 15 decimal-digit numbers require about 50 bits.

Also, the decimal digit 8 requires about 3 binary bits, so 3 binary bits allows storage for roughly another decimal digit).

Putting these observations together, we might expect that the 53-bits utilized in the 64-bit international standard can store about 16 decimal digits of accuracy.

To gain first-hand experience with the storage of double numbers in C/C++, Problem 3 considers the storage of the following numbers.

  "0.1234567890123456789012345678901234567890" ;   // digits for easy counting
  "0.2424242424242424242424242424242424242424" ;   // all digits < 5
  "0.6868686868686868686868686868686868686868" ;   // all digits > 5

To investigate the storage of double numbers, the program double-storage.c prints the double to several decimal places of accuracy.

Download and compile this program.
1. Run the program, based on the number
  0.1234567890123456789012345678901234567890
  where the digit pattern can help count individual decimal places.
  - How does the program store and print the number exactly to 40 decimal places of accuracy? That is, how are the 40 digits stored and printed exactly?
  - The output of the program is organized into groups of lines. Describe what is printed on each line of a group. Also, indicate how each output line is obtained. (You may need to consult a C/C++ manual to understand some functions, such as sprintf.)
  - printf tries to round a double to the number of decimal places specified. For the output involving 13 to 16 decimal places, does the output reflect this rounding?
  - What can you say about rounding (or the lack thereof), when 17 or more decimal places are printed?
  - As the number of digits are printed (after 17 decimal places), what can you say about the accuracy of the double number printed? Why do you think this accuracy is (or is not) observed?
2. Repeat part a, after modifying the program to process the number
  0.2424242424242424242424242424242424242424
  where all digits are < 5, so no rounding would be appropriate.
3. Repeat part a, after modifying the program to process the number
  0.6868686868686868686868686868686868686868
  where all digits are > 5, so rounding up would always be appropriate;

Associativity of Addition for Real Numbers

Over the years, many approaches have been developed to compute the value of the number π. Many of these approaches are based on an infinite series, one of which is

Details behind this formula may be found in a Wikipedia article on Leibniz formula for π and a stockexchange.com article on Series that converge to π quickly.

Although this is an infinite sum, calculus (and algebra) indicates that successively better approximations to π may be obtained by including more and more terms of the series. Also, it is worth noting that computationally each term is smaller than the previous.

Program pi-approx.c

asks the user how many terms n in the series to compute,
computes the desired number of terms,
prints the first two and last two terms calculated,
computes and prints the sum, starting from term i=0 to i=n-1 (that is, starting with the largest term and adding successive smaller terms), and also
computes and prints the sum, starting from term i=n-1 to i=0 (that is, starting with the smallest term and adding successive larger terms).

Read, analyze, download, compile, and run program pi-approx.c
1. In reading the program, how are successive terms in the series computed—explain briefly why this approach gives the desired sum of terms.
2. In past years, some students have indicated confusion regarding which of the terms, T[0], T[1], ..., T[n-1] and T[n], are small and which are large. Of course, the array indices 0, 1, ... , n-1, n are progressively larger, but what about the array elements themselves?
  - In the program, the computation of the terms involves the statement
```
   T[i] = 2.0 * i * T[i-1] / (2.0 * (2.0*i+1.0));
            
```
    Based on this computation, explain algebraically why each computed term is progressively smaller than the previous.
  - Based on the printout of the first and last terms, confirm (in a written statement) that T[0] > T[1] > T[n-1] > T[n], so adding from T[0] up to T[n] adds numerical values from largest to smallest, and adding from T[n] down to T[0] adds numerical values from smallest to largest.
  (Note: Although this may or may not seem clear from the program or algebra, it is vital for the rest of this problem to understand that when the indices of the array elements get larger the values being added get smaller—be sure to ask about this if you have any questions!)
3. Describe the output generated with the number of terms being 10, 25, 40, 50, 60, 100, and 1000.
4. To what extent does including more terms to the sum help the accuracy when computing from the largest term to the smallest? Explain.
5. To what extent does including more terms to the sum help the accuracy when computing from the smallest term to the largest? Explain.
6. If there is a difference when computing from largest term to smallest versus smallest term to largest, explain the difference. What, if any, conclusions are suggested by the outputs observed from this exercise?

Compounding of Numeric Error

Our discussions of the representation of real numbers (doubles and floats) have identified at least three factors that can cause errors in processing—particularly if the errors can accumulate as processing continues.

numerical errors can accumulate in some situations during processing (particularly in loops that are repeated many times),
loops may not continue through the proper number of iterations,
the order of addition can make a difference:
- if small numbers are added to large ones, the small numbers may be lost
- if small numbers are added together first and then to large ones, then the small values may have an impact in the overall sum

Be sure to take these potential troubles into account in answering Steps 5 and 6.

Given that start < end, suppose a loop is to begin at start and finish at (or near) end in exactly n+1 iterations. Within this loop, suppose the control variable will increase by a computed value increment = (end-start)/n with each iteration.

Two loop structures are proposed:
```
      // approach 1
      increment = (end - start)/n;
      for (i = 0; i <= n; i++){
           value = start + i * increment;              
          /* processing for value */
        }
    
```
```
      // approach 2
      value = start;
      increment = (end - start)/n;
      while (value <= end) {
         /* processing for value */
         value += increment;             
      }
    
```
Although the first approach requires somewhat more arithmetic within the loop than the second, it likely will provide better accuracy. Identify two distinct reasons why the first approach should be preferred over the second.
Suppose y = f(x) is a function that decreases significantly from x=a to x=b, on the interval [a, b], with a < b.

Throughout this interval, assume f(x)>0, and assume the Trapezoidal Rule were to be used to approximate the area under y = f(x) on the interval [a, b].
1. Assuming accuracy is the highest priority for this computation, should the main loop begin at a and go toward b or begin at b and go toward a, or is either order fine? Explain.
2. Again, assuming accuracy of the answer is the highest priority, write a reasonably efficient code that implements the Trapezoidal Rule for this function on this interval. (To be reasonably efficient, f(x) should be computed only once for each value of x, and division by 2 should be done as little as possible, as discussed in class.)
  Be sure to include your code within a program, and run several tests of the program.
  
  For this step, submit both the program and the output from several test runs.
  
  (Of course, your program must conform to the course's C/C++ Style Guide.)
3. Explain how and why your approach to this problem (with f(x) decreasing significantly from x=a to x=b) should be different from the code when f(x) increases over this interval.

Table 1: `i`, `j` can be any non-negative integers.	overflow ???			ignoring overflow, expression gives correct answer
expression	NO OVERFLOW for all but a few `i`, `j`	NO OVERFLOW unless possibly when `i=INT_MAX`	NO OVERFLOW unless possibly when `j=INT_MAX`	`i` even, `j`even	`i` odd, `j`even	`i` even, `j`odd	`i` odd, `j`odd
`avg1 = (i + j) / 2;`
`avg2 = i/2 + j/2;`
`avg3 = (i+1)/2 + j/2;`
`avg4 = (i+1)/2 + (j+1)/2;`
`avg5 = i + (j-i)/2;`

Table 2: `i`, `j`can be any integers (positive, negative, or zero)	overflow ???			ignoring overflow, expression gives correct answer
expression	NO OVERFLOW for all but a few `i`, `j`	NO OVERFLOW unless possibly when `i=INT_MAX` or `i=-INT_MAX`	NO OVERFLOW unless possibly when `j=INT_MAX` or `j=-INT_MAX`	`i` even, `j`even	`i` odd, `j`even	`i` even, `j`odd	`i` odd, `j`odd
`avg1 = (i + j) / 2;`
`avg2 = i/2 + j/2;`
`avg3 = (i+1)/2 + j/2;`
`avg4 = (i+1)/2 + (j+1)/2;`
`avg5 = i + (j-i)/2;`

created 31 March 2022 expanded 24 July 2022 expanded 3 January 2023 modest editing Summer 2023 revised 20 November 2024
For more information, please contact Henry M. Walker at walker@cs.grinnell.edu.

	Sonoma State University

Algorithm Analysis
Instructor: Henry M. Walker Lecturer, Sonoma State University Professor Emeritus of Computer Science and Mathematics, Grinnell College

Assignment on Consequences of Data Representation

Integer Overflow

Storage of Real Numbers and its Accuracy

Associativity of Addition for Real Numbers

Compounding of Numeric Error