Characters in C

At a basic level, human writing is culture-dependent; a group of people agree that specific shapes convey meaning. For example, early tribes in some parts of the world used pictures or pictographs to record events or activities; people in Europe used various alphabets (e.g., a Latin or Greek or Cyrillic alphabet), people in Asia used strokes to represent syllables; etc. Pragmatically, a wide range of approaches are possible, provided people within a region agree on what each symbol represents.

On the other hand, within a digital computer, we have noted that the use of electrical circuits at the hardware level leads to the storage of data (e.g., numbers) as binary sequences (e.g., sequences of 0's and 1's).

Putting these two ideas together, human writing can be captured within computers by encoding each writing element as a specified binary sequence. As a simple example, 00110001 might represent the lower case letter 'a' in a Latin alphabet, or 00100011 might represent number sign (#). Ultimately, any encoding is arbitrary — programmers and users just need to agree what encoding to use for a particular program and application.

Encoding(s) of Characters in C

The 2011 C Draft Standard discusses three types of character data:

char, a type of small integer, involving a small, specified number of bits (e.g., 8 bits)
wchar_t, a "wide character type", involving relatively large, specified number of bits (e.g., 32 bits)
multi-byte characters, in which storage for characters may vary in size (e.g., between 1 and 4 bytes).

To understand the motivation behind several character types and how to utilized these types within programs, some history is helpful.

After discussion of these alternative encoding systems, this reading provides a sample program that illustrates some uses of the char data type.

The 2011 C Standard

The 2011 C standard, ISO/IEC 9899 - Programming languages - C, was adopted in 2011 by ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) and published 8 December 2011. The online page for this standard notes, "Published ISO and IEC standards can be purchased from a member body of ISO or IEC." The same page notes, "The latest publically available version of the C11 standard is the document WG14 N1570, dated 2011-04-12. This is a WG14 working paper, but it reflects what was to become the standard at the time of issue."

Discussion throughout this course is based on the publically available version.

Early History

In the first several decades of computing (certainly through the 1970s and well into the 1980s), at least two circumstances impacted the storage and processing of character data:

Computer memory was reasonably small and expensive, so there was strong incentive to allocate as few bits as possible for individual characters.
Much computing was being done in the United States or in the United Kingdom — both English-speaking areas, so early computing typically focused on a Western, Latin alphabet and symbols.
Computing hardware and software varied greatly from manufacturer to manufacturer, and several different encodings of character data were in use.

Two examples illustrate important variations in character encoding during this early period.

Extended Binary Coded Decimal Interchange Code (EBCDIC) was developed in 1963-1964 for large IBM mainframe computers, often used within large companies for commercial data processing. For many years, IBM's System/360 computers dominated the commercial market, so this encoding system was widely used for many large-scale applications.

Further, in early days of computing, much data were stored on punch cards, and EBCDIC evolved as an extension of the codes used with those cards.

Technically, EBCDIC represented an 8-bit character encoding. Also, within this code, it is interesting to note that lowercase letters had a smaller encoding number than uppercase letters.

The American Standard Code for Information Exchange (ASCII) developed from telegraphic codes. In addition to 95 printable characters, ASCII included 33 additional characters to help control teletype machines. For example, some control characters included "start of message" (SOM), "end of message" (EOM), and "end of transmission" (EOT).

Use of ASCII within a telegraphic application often would involve an operator typing on one machine, characters transmitted over a distance, and characters printed at the distant machine. Within this transmission process, electrostatic noise could cause interference, and a mechanism was desired to help check if a message received was likely to be the message sent.

Technically, ASCII represented a 7-bit character encoding. To test for errors in transmission, an extra bit was added to each character, so the number of bits in an 8-bit sequence was even. A receiving machine could then receive an 8-bit sequence, use the first 7 bits as a character and the 8th bit to determine that an even number of 1's were present.

In practice, the ASCII became prominent for small and medium-sized computers. Since many applications in science and engineering utilized these relatively inexpensive machines, the use of ASCII dominated those fields. Also, with the emergence of mini-computers and personal computers, ASCII became widely adopted within small businesses, academic applications, and home use.

In using a 7-bit character encoding, the 8th bit was called a parity bit: If the number of 1's in the 7-bit encoding was odd, the parity bit was given the value 1; if the number of 1's in the 7-bit encoding was even, the parity bit was 0. Combined, the 7-bit character encoding plus the parity bit would always contain an even number of bits.

If electrical interference caused an error in the transmission and receipt of 1 bit within an 8-bit grouping, the parity bit would identify an error had occurred. We could not determine which bit was wrong, but we would know that the full 8-bit grouping should be re-transmitted.

If electrical interference changed several bits in an 8-bit sequence, a parity bit might not detect multiple transmission errors. However, if interference impacted several characters, it would be unlikely that the parity bits would check correctly for all transmitted characters. Thus, if parity bits did not match properly, all or part of a message could be re-transmitted.

Basic character sets

Within C, the char data type conceptually accommodates this varied history for character encoding, while allowing extended characters to address local needs. The following specifications come from Section 5.2.1 Character sets of the 2011 draft C Standard, page 22 and following.

C allows two sets of characters (with two potentially-different encodings): a source character set for the C program source code, and a execution character set for data when the program runs. In practice, the same characters and encodings usually are used both for a C program and for data, even though the two could be different in principle.

Every character set (either for the C program or for data) must include the following basic character set, based on a Latin alphabet:

Category	Characters
uppercase letters	`A B C D E F G H I J K L M N O P Q R S T U V W X Y Z`
lowercase letters	`a b c d e f g h i j k l m n o p q r s t u v w x y z`
decimal digits	`0 1 2 3 4 5 6 7 8 9`
parentheses / braces	( ) [ ] { }
arithmetic symbols	+ - * / = < >
punctuation	`! " # % & ' , . : ; ? _ \ ^ \| ˜`
space character	[obtained by pressing the space bar once on a keyboard]
horizontal tab, denoted \t	described in Section 5.2.2 as "Moves the active position to the next horizontal tabulation position on the current line. If the active position is at or past the last defined horizontal tabulation position, the behavior of the display device is unspecified."
vertical tab, denoted \v	described in Section 5.2.2 as "Moves the active position to the initial position of the next vertical tabulation position. If the active position is at or past the last defined vertical tabulation position, the behavior of the display device is unspecified."
form feed, denoted \f	move to the top of the next page
null character	all bits in an encoding set to 0

A few additional characters are required for the execution character set (for data).

alert, denoted \a	described in Section 5.2.2 as "Produces an audible or visible alert without changing the active position."
backspace, denoted \b	described in Section 5.2.2 as "Moves the active position to the previous position on the current line. If the active position is at the initial position of a line, the behavior of the display device is unspecified."
new line, denoted \n	described in Section 5.2.2 as "Moves the active position to the initial position of the next line."
carriage return, denoted \r	described in Section 5.2.2 as "Moves the active position to the initial position of the current line."

Regarding character encodings, the 2011 draft C Standard provides limited guidance:

Traditional printers utilized a printing element that could move right, one character at a time to the left, and/or to the beginning of a new line. Consistent with these traditional printing movements, characters in C include a horizontal tab \t that moves to the right, but there is no corresponding character for moving backward to a tab spot. Similarly, there is a character to move downward, but there is not capability to move up.

Altogether, the source character set is required to contain 96 characters. Four more characters (yielding 100 total characters) are required for the execution character set.
- Encodings of these characters must fit within 8 bits.
- Additional characters and encodings may be defined, but these are locally defined and may vary from one machine or environment to another.
- Specific encodings for prescribed characters are NOT mandated!
- Encoding values for the digits 0, 1, ..., 9 must increase by one through the sequence. That is, the encoding for the digit 1 must be one more than the encoding number for the digit 0, the encoding for the digit 2 must be one more than the encoding number for the digit 1, etc.

The 2011 draft C Standard updates its definition of the term byte. Within C, a byte is the smallest memory unit that one's local computer can address directly. For many contemporary machines, this size is 8 bits — consistent with the traditional definition.

Through this course, we often will use "byte" to mean 8 bits. However, note that the C standard allows a larger size, possibly anticipating future mathcines.

Character Types in C

With this background, we can review quickly the data types available within C for character data.

Type `char`: a small `int`

With C, type char is widely used for characters in the basic character set. Technically, char is considered to be a type of small integer, the integer value represents the encoding of the relevant character, and a char requires at least 8 bits of storage. Further, the underlying code for each of prescribed character in the basic character set is required to be non-negative.

Technically, since type char is considered an integer type, C allows char data to be negative and non-negative. This possibility yields several additional details for the C programming language.

C allows defines types signed char and unsigned char, in addition to type char.
C does not indicate whether type char itself is signed or unsigned, and different compilers make different choices. Thus, if it matters in your program, you should explicitly declare variables as signed char or unsigned char.
Since all characters in the basic character set are required to be non-negative, programs using just these characters can use any of the types char, signed char or unsigned char.

With this last observation, most programs for this course use only the basic character set and thus can use type char without complication.

Caution:

Although the 8-bit code extending ASCII is widely used today in C, writing these numbers within code is inherrently dangerous! For example, consider the code

char ch;
ch = 65;   /* the ASCII code for 'A' */

Such code is error prone and may not work for at least three reasons:

Although the code 65 may represent 'A; on many machines, the C standard allows other encodings as well, and code 65 may mean something else when the program is compiled and run on another machine.
Typing the number 65 is error prone: a typographical error is easy to make but can be hard to find.
Arbitrary numbers (e.g., 65) in code are sometimes called magic numbers. Since they have no intrinsic meaning, their use does not clarify the logic of a program.

Programming Hint

To avoid use of magic numbers and to help clarify program meaning, two approaches are recommended.

For printable characters, use the literal character (in single quotes). Thus, one might use 'A' rather than code 65.
For other numbers, define a variable or symbol in place of magic numbers. (Such numbers can be defined once at the beginning of a program, where checking is relatively easy, and the used in their logical form throughout the program.) For example, the number Pi might be defined.
```
#define Pi 3.1415926535
```
or
```
const double Pi = 3.1415926535;
```

In practice, many implementations of char in C build upon the traditional ASCII character code. Using an 8-bit size for a byte, the 7-bit ASCII code is modified by adding an initial 0. Additional characters may or may not be included, possibly including 8-bit encodings with an initial 1.

The following table shows the codes and characters for the printable characters within the ASCII standard.

int char	int char	int char	int char	int char
33 !	34 "	35 #	36 $	37 %
38 &	39 '	40 (	41 )	42 *
43 +	44 ,	45 -	46 .	47 /
48 0	49 1	50 2	51 3	52 4
53 5	54 6	55 7	56 8	57 9
58 :	59 ;	60 <	61 =	62 >
63 ?	64 @	65 A	66 B	67 C
68 D	69 E	70 F	71 G	72 H
73 I	74 J	75 K	76 L	77 M
78 N	79 O	80 P	81 Q	82 R
83 S	84 T	85 U	86 V	87 W
88 X	89 Y	90 Z	91 [	92 \
93 ]	94 ^	95 _	96 `	97 a
98 b	99 c	100 d	101 e	102 f
103 g	104 h	105 i	106 j	107 k
108 l	109 m	110 n	111 o	112 p
113 q	114 r	115 s	116 t	117 u
118 v	119 w	120 x	121 y	122 z
123 {	124 \|	125 }	126 ~

Type `wchar_t`: for international applications

Since type char utilizes little storage (typically just 8 bits), the range of values available is quite limited. C's basic character set supports a limited Latin alphabet involving about 100 characters. Since 8 bits provide sufficient space for 256 values, various extensions may be used for local applications; for example, characters may be added for a Greek or Cyrillic alphabet. However, requirements for a world-wide audience extend far beyond the 256 character limit.

To address a world-wide audience, type wchar_t and its supporting wchar.h library allocate substantially more space. (Technically type char16_t allocates 16 bits and char32_t allocates 32 bits. Also, as with type char, type wchar_t is considered a type of integer.)

Type wchar_t provides up to 32 bits of storage, allowing the encoding of numerous alphabets, pictographs, and other symbols. As with other encoding details, the 2011 draft C Standard does not dictate what encoding to use. However, in practice, various implementations often utilize the Unicode Standard, described by the Unicode Standard's home page as, "a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. In addition, it supports classical and historical texts of many written languages."

Details of Unicode and wchar_t are beyond the scope of this course. Readers interested in applications for world-wide audiences are encouraged to explore various online descriptions and tutorials for both wchar_t and Unicode.

Multibyte Character Storage

Both char and wchar_t types represent fixed-sized storage for characters, providing two important properties:

When encountering a variable declaration:
```
char char1;
wchar_t char2;
```
the compiler is able to allocate the proper amount of space.
Given an array of type char or type wchar_t, finding the i^th item in the array is efficient. Start at the beginning of the array and move over (i-1) items (of the given size) to find the i^th item. No need to search item by item to locate item i!

However, both types char and wchar_t have disadvantages as well:

Type char is inadequate to store sufficient encodings for a world-wide audience.
Type wchar_t requires substantial storage, even when only limited characters are needed.

To resolve such troubles, the 2011 draft C Standard supports a multibyte capability. In this encoding, character encodings have variable lengths. Some characters may require as little as 1 byte storage, while others may require up to 4 bytes (or even more). With variable lengths possible, the range of allowed characters can be extensive — sufficient for identified world-wide needs for an application. Also, not all characters require 32 (or more) bits, possibly limiting storage requirements.

Complementing the notion of variable-length encodings, the 2011 draft C Standard also describes the concept of a locale, with which a local computing environment can define specific character sets for applications that run in that environment.

As with the topics of Unicode and wchar_t, details of multi-byte character encodings within C are beyond the scope of this course. Interested readers might explore an emerging coding system, called UTF-8, a variable-length encoding system which builds upon Unicode. (UTF-8 is an abbreviation for "Universal Coded Character Set + Transformation Format – 8-bit". Explore online references to unpack the jargon!)

Sample program using type `char`

Most basic characters in a C program can be referenced by placing the character in single quotes:

char ch1 = 'a';
char ch2 = '?';
char ch3 = '7';

However, three characters have special meanings in C programs:

a single quote ' is used to identify characters (as above).
a double quote " is used to identify a sequence of characters (e.g., in a printf statement).
a backslash \ is used as a special deliminter (e.g., as described above, a horizontal tab is \h and a vertical tab is \v).

To avoid confusion with other uses within a C program, when we intend to specify these three characters, we precede each with a backslash, as illustrated below;

char ch4 = '\\';    /* assign the backslash character to variable ch4 */

The following C program character-example.c builds upon these declarations to illustrate how variables of type char might be used in practice.

/* program to illustrate the use of the char data type
 */

#include <stdio.h>
#include <ctype.h>  /* draw upon several useful char functions */

int main ()
{

The ctype library provides several functions than can be helpful in working with char data.

  /* variable declarations */
  char ch1 = 'a';
  char ch2 = '?';
  char ch3 = '7';
  char ch4 = '\\';
  char ch5, ch6, ch7, ch8;

As with numeric variables, character variables may be initialized when declared, or variables may be declared early in a program and assigned values later.

  /* character encoding is required to increase by 1
     for each digit 0, 1, 2, ..., 9
  */
  ch5 = ch3 + 1;   // char is a small int, so addition possible
  ch6 = ch5 - 4;   // subtraction also possible

Since C requires the coding for digits to increase by one for each digit, the encoding for '7' plus 1 must give the encoding for '8', which is assigned to ch5.

  /* compare two characters by comparing encodings */
  if (ch6 == '4')
    printf ("ch6 is digit '4'\n");
  else
    printf ("ch6 is NOT digit '4'\n");

Following the arithmetic for the values of ch3, ch5, and ch6, the variable ch6 should contain the encoding for character '4'. Since char is a type of int, the equality operation == can be used to compare char values.

  /* utilize ctype library */
  /* determine if ch1 is a digit 0, ..., 9 */
  if (isdigit (ch1))
    printf ("%c is a digit\n", ch1);
  else
    printf ("%c is not a digit\n", ch1);

The ctype library provides several tests to determine the type of character represented in a variable. Each function returns 1 (true) if the character fits within the prescribed category and 0 (false) otherwise.

function	category tested
isalpha	character is an alphabetic letter (either uppercase or lowercase)
isdigit	character is a decimal digit
isalnum	character is either alphabetic or a decimal digit
islower	character is a lowercase letter
isupper	character is an uppercase letter
isxdigit	character is either alphabetic or a hexadecimal digit
isspace	character is space, `\f, \n, \r, \t` or `\v`
isprint	character is a printable character, including a space
ispunct	character is a printable character, except a space, letter, or digit
isgraph	character is character is a printable character, except a space
iscntrl	character is a control character (e.g., `\a, \b, \f`)

  /* convert lower case letter to upper case,
     other characters not changed */
  ch7 = toupper (ch1);
  ch8 = toupper (ch2);

The ctype library contains two functions to selectively transform letters. In each case, letters within a category are changed, but all other characters remain the same.

function	description
tolower	return lowercase letters if given uppercase letters; return values for all other characters are the same as the original
toupper	return uppercase letters if given lowercase letters; return values for all other characters are the same as the original

  /* print characters and their codes */
  printf ("characters and their codes\n");
  printf ("character\tcode\n");
  printf ("   %c \t\t %d\n", ch1, ch1);
  printf ("   %c \t\t %d\n", ch2, ch2);
  printf ("   %c \t\t %d\n", ch3, ch3);
  printf ("   %c \t\t %d\n", ch4, ch4);
  printf ("   %c \t\t %d\n", ch5, ch5);
  printf ("   %c \t\t %d\n", ch6, ch6);
  printf ("   %c \t\t %d\n", ch7, ch7);
  printf ("   %c \t\t %d\n", ch8, ch8);

  return 0;
}

Since a char is a type of small int, printing can utilize either of two formats:

%c format within a printf statement prints the character represented by a variable.
%d format within a printf statement prints the integer encoding of the character.

created 25 May 2016 by Henry M. Walker
expanded and edited 27 May 2016 by Henry M. Walker

For more information, please contact Henry M. Walker at walker@cs.grinnell.edu.

CSC 115.005/006	Sonoma State University	Spring 2022
	CSC 115.005/006: Programming I
Instructor: Henry M. Walker Lecturer, Sonoma State University Professor Emeritus of Computer Science and Mathematics, Grinnell College

created 25 May 2016 by Henry M. Walker expanded and edited 27 May 2016 by Henry M. Walker
For more information, please contact Henry M. Walker at walker@cs.grinnell.edu.

Notes: