CSC 115.005/006 Sonoma State University Spring 2022
Scribbler 2
CSC 115.005/006:
Programming I
Scribbler 2
Instructor: Henry M. Walker

Lecturer, Sonoma State University
Professor Emeritus of Computer Science and Mathematics, Grinnell College


Course Home References Course Details: Syllabus, Schedule, Deadlines, Topic organization MyroC Documentation Project Scope/
Acknowledgments

Notes:

Characters in C

At a basic level, human writing is culture-dependent; a group of people agree that specific shapes convey meaning. For example, early tribes in some parts of the world used pictures or pictographs to record events or activities; people in Europe used various alphabets (e.g., a Latin or Greek or Cyrillic alphabet), people in Asia used strokes to represent syllables; etc. Pragmatically, a wide range of approaches are possible, provided people within a region agree on what each symbol represents.

On the other hand, within a digital computer, we have noted that the use of electrical circuits at the hardware level leads to the storage of data (e.g., numbers) as binary sequences (e.g., sequences of 0's and 1's).

Putting these two ideas together, human writing can be captured within computers by encoding each writing element as a specified binary sequence. As a simple example, 00110001 might represent the lower case letter 'a' in a Latin alphabet, or 00100011 might represent number sign (#). Ultimately, any encoding is arbitrary — programmers and users just need to agree what encoding to use for a particular program and application.


Encoding(s) of Characters in C

The 2011 C Draft Standard discusses three types of character data:

To understand the motivation behind several character types and how to utilized these types within programs, some history is helpful.

After discussion of these alternative encoding systems, this reading provides a sample program that illustrates some uses of the char data type.

The 2011 C Standard

The 2011 C standard, ISO/IEC 9899 - Programming languages - C, was adopted in 2011 by ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) and published 8 December 2011. The online page for this standard notes, "Published ISO and IEC standards can be purchased from a member body of ISO or IEC." The same page notes, "The latest publically available version of the C11 standard is the document WG14 N1570, dated 2011-04-12. This is a WG14 working paper, but it reflects what was to become the standard at the time of issue."

Discussion throughout this course is based on the publically available version.


Early History

In the first several decades of computing (certainly through the 1970s and well into the 1980s), at least two circumstances impacted the storage and processing of character data:

Two examples illustrate important variations in character encoding during this early period.


In practice, the ASCII became prominent for small and medium-sized computers. Since many applications in science and engineering utilized these relatively inexpensive machines, the use of ASCII dominated those fields. Also, with the emergence of mini-computers and personal computers, ASCII became widely adopted within small businesses, academic applications, and home use.

In using a 7-bit character encoding, the 8th bit was called a parity bit: If the number of 1's in the 7-bit encoding was odd, the parity bit was given the value 1; if the number of 1's in the 7-bit encoding was even, the parity bit was 0. Combined, the 7-bit character encoding plus the parity bit would always contain an even number of bits.

If electrical interference caused an error in the transmission and receipt of 1 bit within an 8-bit grouping, the parity bit would identify an error had occurred. We could not determine which bit was wrong, but we would know that the full 8-bit grouping should be re-transmitted.

If electrical interference changed several bits in an 8-bit sequence, a parity bit might not detect multiple transmission errors. However, if interference impacted several characters, it would be unlikely that the parity bits would check correctly for all transmitted characters. Thus, if parity bits did not match properly, all or part of a message could be re-transmitted.


Basic character sets

Within C, the char data type conceptually accommodates this varied history for character encoding, while allowing extended characters to address local needs. The following specifications come from Section 5.2.1 Character sets of the 2011 draft C Standard, page 22 and following.


Traditional printers utilized a printing element that could move right, one character at a time to the left, and/or to the beginning of a new line. Consistent with these traditional printing movements, characters in C include a horizontal tab \t that moves to the right, but there is no corresponding character for moving backward to a tab spot. Similarly, there is a character to move downward, but there is not capability to move up.


The 2011 draft C Standard updates its definition of the term byte. Within C, a byte is the smallest memory unit that one's local computer can address directly. For many contemporary machines, this size is 8 bits — consistent with the traditional definition.

Through this course, we often will use "byte" to mean 8 bits. However, note that the C standard allows a larger size, possibly anticipating future mathcines.


Character Types in C

With this background, we can review quickly the data types available within C for character data.


Type char: a small int

With C, type char is widely used for characters in the basic character set. Technically, char is considered to be a type of small integer, the integer value represents the encoding of the relevant character, and a char requires at least 8 bits of storage. Further, the underlying code for each of prescribed character in the basic character set is required to be non-negative.

Technically, since type char is considered an integer type, C allows char data to be negative and non-negative. This possibility yields several additional details for the C programming language.

With this last observation, most programs for this course use only the basic character set and thus can use type char without complication.

Caution:

Although the 8-bit code extending ASCII is widely used today in C, writing these numbers within code is inherrently dangerous! For example, consider the code

char ch;
ch = 65;   /* the ASCII code for 'A' */

Such code is error prone and may not work for at least three reasons:

Programming Hint

To avoid use of magic numbers and to help clarify program meaning, two approaches are recommended.

In practice, many implementations of char in C build upon the traditional ASCII character code. Using an 8-bit size for a byte, the 7-bit ASCII code is modified by adding an initial 0. Additional characters may or may not be included, possibly including 8-bit encodings with an initial 1.

The following table shows the codes and characters for the printable characters within the ASCII standard.

int charint charint charint charint char
33 ! 34 " 35 # 36 $ 37 %
38 & 39 ' 40 ( 41 ) 42 *
43 + 44 , 45 - 46 . 47 /
48 0 49 1 50 2 51 3 52 4
53 5 54 6 55 7 56 8 57 9
58 : 59 ; 60 < 61 = 62 >
63 ? 64 @ 65 A 66 B 67 C
68 D 69 E 70 F 71 G 72 H
73 I 74 J 75 K 76 L 77 M
78 N 79 O 80 P 81 Q 82 R
83 S 84 T 85 U 86 V 87 W
88 X 89 Y 90 Z 91 [ 92 \
93 ] 94 ^ 95 _ 96 ` 97 a
98 b 99 c 100 d 101 e 102 f
103 g 104 h 105 i 106 j 107 k
108 l 109 m 110 n 111 o 112 p
113 q 114 r 115 s 116 t 117 u
118 v 119 w 120 x 121 y 122 z
123 { 124 | 125 } 126 ~

Type wchar_t: for international applications

Since type char utilizes little storage (typically just 8 bits), the range of values available is quite limited. C's basic character set supports a limited Latin alphabet involving about 100 characters. Since 8 bits provide sufficient space for 256 values, various extensions may be used for local applications; for example, characters may be added for a Greek or Cyrillic alphabet. However, requirements for a world-wide audience extend far beyond the 256 character limit.

To address a world-wide audience, type wchar_t and its supporting wchar.h library allocate substantially more space. (Technically type char16_t allocates 16 bits and char32_t allocates 32 bits. Also, as with type char, type wchar_t is considered a type of integer.)

Type wchar_t provides up to 32 bits of storage, allowing the encoding of numerous alphabets, pictographs, and other symbols. As with other encoding details, the 2011 draft C Standard does not dictate what encoding to use. However, in practice, various implementations often utilize the Unicode Standard, described by the Unicode Standard's home page as, "a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. In addition, it supports classical and historical texts of many written languages."

Details of Unicode and wchar_t are beyond the scope of this course. Readers interested in applications for world-wide audiences are encouraged to explore various online descriptions and tutorials for both wchar_t and Unicode.

Multibyte Character Storage

Both char and wchar_t types represent fixed-sized storage for characters, providing two important properties:

However, both types char and wchar_t have disadvantages as well:

To resolve such troubles, the 2011 draft C Standard supports a multibyte capability. In this encoding, character encodings have variable lengths. Some characters may require as little as 1 byte storage, while others may require up to 4 bytes (or even more). With variable lengths possible, the range of allowed characters can be extensive — sufficient for identified world-wide needs for an application. Also, not all characters require 32 (or more) bits, possibly limiting storage requirements.

Complementing the notion of variable-length encodings, the 2011 draft C Standard also describes the concept of a locale, with which a local computing environment can define specific character sets for applications that run in that environment.

As with the topics of Unicode and wchar_t, details of multi-byte character encodings within C are beyond the scope of this course. Interested readers might explore an emerging coding system, called UTF-8, a variable-length encoding system which builds upon Unicode. (UTF-8 is an abbreviation for "Universal Coded Character Set + Transformation Format – 8-bit". Explore online references to unpack the jargon!)


Sample program using type char

Most basic characters in a C program can be referenced by placing the character in single quotes:

char ch1 = 'a';
char ch2 = '?';
char ch3 = '7';

However, three characters have special meanings in C programs:

To avoid confusion with other uses within a C program, when we intend to specify these three characters, we precede each with a backslash, as illustrated below;

char ch4 = '\\';    /* assign the backslash character to variable ch4 */

The following C program character-example.c builds upon these declarations to illustrate how variables of type char might be used in practice.


/* program to illustrate the use of the char data type
 */

#include <stdio.h>
#include <ctype.h>  /* draw upon several useful char functions */

int main ()
{

The ctype library provides several functions than can be helpful in working with char data.


  /* variable declarations */
  char ch1 = 'a';
  char ch2 = '?';
  char ch3 = '7';
  char ch4 = '\\';
  char ch5, ch6, ch7, ch8;

As with numeric variables, character variables may be initialized when declared, or variables may be declared early in a program and assigned values later.


  /* character encoding is required to increase by 1
     for each digit 0, 1, 2, ..., 9
  */
  ch5 = ch3 + 1;   // char is a small int, so addition possible
  ch6 = ch5 - 4;   // subtraction also possible

Since C requires the coding for digits to increase by one for each digit, the encoding for '7' plus 1 must give the encoding for '8', which is assigned to ch5.


  /* compare two characters by comparing encodings */
  if (ch6 == '4')
    printf ("ch6 is digit '4'\n");
  else
    printf ("ch6 is NOT digit '4'\n");

Following the arithmetic for the values of ch3, ch5, and ch6, the variable ch6 should contain the encoding for character '4'. Since char is a type of int, the equality operation == can be used to compare char values.


  /* utilize ctype library */
  /* determine if ch1 is a digit 0, ..., 9 */
  if (isdigit (ch1))
    printf ("%c is a digit\n", ch1);
  else
    printf ("%c is not a digit\n", ch1);

The ctype library provides several tests to determine the type of character represented in a variable. Each function returns 1 (true) if the character fits within the prescribed category and 0 (false) otherwise.

function category tested
isalpha character is an alphabetic letter (either uppercase or lowercase)
isdigit character is a decimal digit
isalnum character is either alphabetic or a decimal digit
islower character is a lowercase letter
isupper character is an uppercase letter
isxdigit character is either alphabetic or a hexadecimal digit
isspace character is space, \f, \n, \r, \t or \v
isprint character is a printable character, including a space
ispunct character is a printable character, except a space, letter, or digit
isgraph character is character is a printable character, except a space
iscntrl character is a control character (e.g., \a, \b, \f)

  /* convert lower case letter to upper case,
     other characters not changed */
  ch7 = toupper (ch1);
  ch8 = toupper (ch2);

The ctype library contains two functions to selectively transform letters. In each case, letters within a category are changed, but all other characters remain the same.

function description
tolower return lowercase letters if given uppercase letters; return values for all other characters are the same as the original
toupper return uppercase letters if given lowercase letters; return values for all other characters are the same as the original

  /* print characters and their codes */
  printf ("characters and their codes\n");
  printf ("character\tcode\n");
  printf ("   %c \t\t %d\n", ch1, ch1);
  printf ("   %c \t\t %d\n", ch2, ch2);
  printf ("   %c \t\t %d\n", ch3, ch3);
  printf ("   %c \t\t %d\n", ch4, ch4);
  printf ("   %c \t\t %d\n", ch5, ch5);
  printf ("   %c \t\t %d\n", ch6, ch6);
  printf ("   %c \t\t %d\n", ch7, ch7);
  printf ("   %c \t\t %d\n", ch8, ch8);

  return 0;
}

Since a char is a type of small int, printing can utilize either of two formats:





created 25 May 2016 by Henry M. Walker
expanded and edited 27 May 2016 by Henry M. Walker
Valid HTML 4.01! Valid CSS!
For more information, please contact Henry M. Walker at walker@cs.grinnell.edu.