Strings in C
Single characters arise naturally in some common application.
-
Grades for a paper, assignment, or test might be 'A', 'B', etc.
-
Various vitamins have 1-letter designations: vitamin 'A', vitamin 'C', vitamin 'D', etc.
However, many [most?] applications involve sequences of characters: words, phrases, sentences, etc. Further, even the applications listed above sometimes use two-or three-character codes: grade "A-" or vitamin "B12". Often a sequence of characters is called a string.
As with many elements of C, the storage of strings is closely tied to the processing of string data. This reading examines both string storage and processing in several sections.
Notation in C
C uses different notation for characters and strings:
-
Single quotes ' are used to identify character data: 'A', 'B', '-', etc.
-
Double quotes " are used to designate string data: "A-", "B12",, etc.
For example, in C, 'A' represents character, whereas "A" represents a string containing one letter. As this reading discusses, C stores these two entities differently.
String storage
Conceptually, a string may be considered as a sequence of characters. In C, this concept is implemented as an array of type char. For example, consider the literal string "Henry M. Walker". Behind the scenes, this string is stored in a char array. 'H' is the character with index 0; 'e' is the character with index 1, etc. Thus, the following code prints the letters H, r, and . on separate lines:
printf ("%c\n", "Henry M. Walker"[0]); printf ("%c\n", "Henry M. Walker"[3]); printf ("%c\n", "Henry M. Walker"[7]);
Within C, a literal string is enclosed in double quotes and cannot be changed. The string, "Henry M. Walker", is stored in a section of main memory reserved for constant data. Thus, this string can be accessed, characters within the string can be referenced, and the string can be copied. However, the string itself cannot be modified.
Since much text processing involves editing, a mechanism is needed to store sequences of characters in variables, for which data can be adjusted. Sometimes editing will change the number of characters in the sequence.
Some important capabilities for editing are illustrated in the following table.
starting string | edited string | notes |
---|---|---|
Henry M. Walker | HENRY M. WALKER | change some characters |
Henry M. Walker | Henry Walker | remove characters |
Henry M. Walker | Henry MacKay Walker | add characters |
Allocated space
Although flexibility may be needed for text editing, C requires a compiler to allocate space when each variable is declared. To accommodate both flexibility and space allocation, string storage for variables typically involves three basic actions:
-
Declare a variable of sufficient size to handle both the initial string and anticipated editing.
-
Use the variable name to keep track of the start of each string.
-
Use a special character, called NULL, to mark the actual end of the string.
In calculating array storage for strings, note that every array must contain space for the desired characters plus space for the NULL at the end. Thus, in working with the string "Computing", a program must allocate at least 10 char locations — 9 for the letters within "Computing" and one for the final NULL.
Previously, the discussion of arrays in C indicated that array variables always represent the base address or starting point of an array. Given an array variable, however, there is no way to determine its size. By declaring a char array, we know the starting point for a sequence of characters, but we do not know the either the logical end of the character sequence or the size of the array. The NULL provides a mechanism to determine both the end and the number of characters.
According to the 2011 draft C standard, the NULL character must be encoded as the integer 0. Thus, the following two assignment statements are equivalent.
char ch; ch = NULL; //may generate compiler warning regarding type sizes ch = 0;
In supporting arrays, many modern languages store several descriptive elements, including both the starting point and the size. Although access to this information may be convenient, storing multiple descriptive pieces of data takes space. C chooses to save space and utilize other mechanisms (e.g., the NULL character) for possible descriptive data.
Example 1
The following program string-example-1.c illustrates the declaration, initialization, and simple processing for strings.
/* Program begins with the string "Cs" saves it in a relatively large char array, converts all letters to upper case edits the string to yield "CS FOR ALL" Throughout, printing is accomplished with %s format */
Example 1 commentary
With the program specification given, the actual output of this code is:
original string: Cs capitalized string: CS final string: CS FOR ALL
#include <stdio.h> #include <ctype.h> int main () {
Both the stdio.h and ctype.h libraries are used.
/* save "Cs" string in 14-character array */ char text [14] = "Cs"; printf ("original string: %s\n", text);
data:image/s3,"s3://crabby-images/46fbd/46fbd64c72547f0f381af2145a16015378be1c63" alt="string initiallization"
Declaration and Initialization
As with arrays of any type, the declaration of a char array indicates the type (char), the variable name (e.g., text), and the array size (e.g., 14).
A string array need not be initialized, in which case the characters stored in the array can be anything.
A string array can be initialized as part of the declaration.
-
The literal initializing string (e.g., "Cs"), together with the NULL character at the end must fit within the space allocated.
-
If the initializing string is shorter than the space declared, the rest of the array is filled with NULL characters.
/* convert all letters to upper case */ int i; for (i = 0; i < 14; i++) text[i] = toupper (text[i]); printf ("capitalized string: %s\n", text);
Since each element within a char array is a character, all functions in ctype.h apply to these array elements.
printf in the stdio.h library uses %s format to print strings for char arrays, starting at the given array base address and continuing through the NULL character.
/* add " FOR ALL" to string */ for (i = 0; i < 8; i++) text[i+2] = " FOR ALL"[i];
One approach to edit "CS" to yield "CS FOR ALL" is to copy the characters from the string " FOR ALL", character by character. In this example,
-
The expression " FOR ALL" is considered as an array of characters, so " FOR ALL"[i] is accessing the ith character of the string FOR ALL.
-
With this access, each character of " FOR ALL" is placed 2 locations later in the variable array text, so the letters "CS" are not changed. (Later parts of this reading describe standard C functions to accomplish the copying of letters and many string-processing tasks.)
/* place NULL character at end */ text[10] = 0; printf ("final string: %s\n", text); return 0; }
data:image/s3,"s3://crabby-images/3f776/3f776b5dbc8c6ac94cd65627103005d944a181f6" alt=""
Every string must conclude with a NULL character. This code
explicitly marks the end of the string. Another approach would be to copy
the NULL at the end of FOR ALL" by extending the
previous for loop to yield i < 9.
char arrays and char * variables
Consider the declaration
char text [14];
With this declaration,
-
the computer allocates space for 14 char values
-
the variable text records the base address or starting point of the array.
Expanding this second point, text specifies the address of the first array element; in Example 1 above, text records the address of the character 'C' in the string "Cs". From our work on function parameters, the address of a character could be specified as having type char *.
Putting these pieces together, the declaration char text [14] yields three results:
-
space is allocated (for 14 char values)
-
variable text can be considered to have type char *, and
-
the address stored in text is permanently designated as the address of the first character allocated.
Next, consider the declaration
char * place
With this code, place is identified as the location of some character, but no space for the character has actually been allocated. We cannot use place directly in a program, until the address stored refers to space allocated separately. The following code segment illustrates both some limitations and some capabilities possible with the char * type.
char text1 [8] = "abcde"; char text2 [6] = "wxyz"; char * place = text1; printf ("first string: %s\n", place); place = text2; printf ("second string: %s\n", place);
In this example, the characters "abcde" and "wxyz" are stored in arrays text1 and text2, respectively, and a NULL character is added to the end of each string.
When place is declared and initialized, it refers to the base address for the text1 array, so the first printf statement prints abcde. Later, the assignment place = text2 causes the place variable to refer to the base address for the text2 array, so the first printf statement prints wxyz.
data:image/s3,"s3://crabby-images/e9915/e991558391b8fca5135a1b83740899da2e3b718e" alt="char * example"
To extend this example, we add the following lines after the second printf statement:
place[2] = '7'; printf ("final string: %s\n", place); printf ("final text1: %s\n", text1); printf ("final text2: %s\n", text2);
When this code is executed, place has been assigned the text2 base address. Thus, place[2] references the 'y' character in the text2 array, and the assignment changes 'y' to '7'. Note there is only one copy of the text2 array, and both text2 and place access it.
Moving to the printf statements, the text1 array remains unchanged, so the printing for that line yields abcde. However, the text2 string has changed to "wx7z". Since both text2 and place access this storage, the printing of both of these variables yields the same wx7z.
With the statement place = text2, we say that place is an alias for the variable text2. After the assignment, both of these refer to the same locations in memory, so a change in either yields a change for both.
Potential array overflow
In an earlier modules, we discovered that if arr is an array, then the computer does not check whether the expression arr[i] references a location within the array. Although this issue can arise with any array, the potential for trouble can be high when working with strings. As a simple example, we return to the first example in this reading.
char text [14] = "CS"; /* add " FOR ALL" to string, including the final NULL */ for (i = 0; i < 9; i++) text[i+2] = " FOR ALL"[i];
In this example, all copied characters fit within the first 11 elements of the array. However, consider this variation:
char text [14] = "Computing"; /* add " FOR ALL" to string, including the final NULL */ for (i = 0; i < 9; i++) text[i+9] = " FOR ALL"[i];
In this code, "Computing" has 9 characters (excluding the final NULL), and " FOR ALL" has 8 characters (excluding the NULL). Each of these strings fits nicely within the 14 characters allocated for text. However, the desired combined string will require 9 + 8 + 1 = 18 characters (including the NULL). Thus, the loop will be storing character data beyond the end of the text. Although we can only speculate what variables might be stored in these additional locations, but we know something else will be changed — perhaps the variable i, perhaps another variable, perhaps nothing of interest, or perhaps another part of the program).
As a second simple example, consider the following code segment:
char str [4] = "one"; str[3] = 's'; printf ("str: %s\n", str);
In this code, all characters are stored within the array. However, printing the str array starts with the 'o' in array position 0, but then printing continues until a NULL — wherever that might be! In practice, string processing will cause array accesses beyond the allocated space, with unknown consequences!
data:image/s3,"s3://crabby-images/d9cfd/d9cfd0f959bd99b88feccedbc33e4f2a5feaeefd" alt="string overflow"
Both of these examples illustrate that string processing has the potential to change memory beyond the intended strings. Throughout processing, therefore, there is a need to check that array accesses stay within the space actually allocated for string arrays!
Security Warning
Several security and privacy problems with software arise when array references extend beyond allocated space. If an outsider can place data at locations throughout main memory, then the outsider might be able to change the behavior of a program or might be able to access private information!
String functions in C
C contains numerous library functions that support string processing. Documentation for strings and library functions is widely available. Two basic sources are particularly helpful:
-
Linux documents many capabilities through a terminal-based manual, often called man pages.
-
Within a terminal window, type
man string
to obtain header information and brief descriptions of many string-related functions. -
For details on any specific function, type
man [function name]
For example, for information on the strlen function, typeman strlen
-
-
The Linux man-pages project provides online documentation for both the Linux environment and the standard C libraries. Some common links follow.
-
a listing of string related functions from the strings.h and string.h libraries.
The following table identifies commonly-used functions for char arrays. (Check man pages for additional functions!)
Category | Functions for NULL-terminated strings | Functions limiting processing length | Functions for char arrays, ignoring NULLs |
---|---|---|---|
General: determine length of string | strlen | ||
General: initialize block of memory | memset | ||
String/character copying | strcpy | strncpy | memcpy, memmove |
String concatenation | strcat | strncat | |
String comparison | strcmp | strncmp | memcmp |
Search for character; return location | strchr, strrchr, index, rindex, strpbrk | memchr | |
Search for character; return index | strspn, strcspn | ||
Search for substring | strstr | ||
Break string into pieces | strtok |
Function Notes
The description and use of various string functions build upon many of the ideas discussed throughout this reading!
-
Many functions utilize parameters of type char *.
-
When the functions are called (e.g., in main), the supplied parameters will specify arrays with space already allocated.
-
The char * parameters serve as references to the actual parameters.
-
-
Function parameters often serve in three different roles.
-
A source or src parameter provides data to be used within
the function. Often this parameter is not changed, so this parameter is
declared:
const char * src
-
A destination or dest parameter refers to the array that
will be changed. This parameter often is declared:
char * dest
-
Some functions limit processing to a specified number of characters, in
which case the parameter often is declared:
size_t n
In this context, think of size_t as a type of non-negative integer.
-
A source or src parameter provides data to be used within
the function. Often this parameter is not changed, so this parameter is
declared:
Example 2
The following program string-example-2.c illustrates several elements of string processing, including the use of char arrays and string functions.
/* program to compile information about people in one family */ #include <string.h> #include <stdio.h> int main () {
Program notes
-
Many string functions are defined in C's string.h library.
/* initialize two given_name names, as character arrays with NULL at end */ char given_name1 [10] = {'H', 'e', 'n', 'r', 'y', 0}; char given_name2 [10] = {'T', 'h', 'e', 'r', 'e', 's', 'a', 0};
-
Any array may be initialized by listing the array elements one -by-one.
-
When char arrays will be interpreted as strings, a 0 (or NULL) is needed at the end of each string.
/* initial two more given_name names as strings */ char given_name3 [10] = "Donna"; char given_name4 [10] = "Barbara"; /* initial common last name in family */ char surname [10] = "Walker";
-
Initialization of string arrays also may specify the string.
-
Remember:
- String data are enclosed in double quotes "
- Character data are enclosed in single quotes '
/* add space before surname */ char space_surname [20] = " "; strcat (space_surname, surname);
-
After initializing space_surname to a space, strcat concatenates the surname string. Overall, space_surname contains a string, followed by the original surname.
/* compute full names */ char full_name1 [20]; char full_name2 [20]; char full_name3 [20]; char full_name4 [20];
-
We allocate sufficient space to store the combined given name and surname.
/* copy given names */ strcpy (full_name1, given_name1); strcpy (full_name2, given_name2); strcpy (full_name3, given_name3); strcpy (full_name4, given_name4);
-
The strcpy function copies the second argument (e.g., given_name1) to the start of the first argument (e.g., full_name1).
-
Most string functions place the destination string (dest) first and the source (src) second. Copying thus goes from the second argument to the first.
/* combine given and surname (with space) to obtain full name */ strcat (full_name1, space_surname); strcat (full_name2, space_surname); strcat (full_name3, space_surname); strcat (full_name4, space_surname);
-
Whereas strcpy copies to the beginning of the destination array, strcat adds the second string after the first — overwriting the NULL character originally at the end of the first string.
/* print full names in family */ printf ("People in this family\n"); printf (" %s\n", full_name1); printf (" %s\n", full_name2); printf (" %s\n", full_name3); printf (" %s\n", full_name4);
-
printf prints strings using the %s format. Printing continues from the start of the char array until a NULL character is found.
/* determine which given_name comes first in alphabetical order check name by name; before indicates earliest name found during processing */ char * before = given_name1; if (strcmp (before, given_name2) > 0) before = given_name2; if (strcmp (before, given_name3) > 0) before = given_name3; if (strcmp (before, given_name4) > 0) before = given_name4; printf ("Given name coming first in alphabetical order: %s\n", before); printf ("Number of characters in first alphabetical name: %d\n", strlen (before)); return 0; }
-
The strcmp function compares two strings, following dictionary order.
- If the first string comes before the second in alphabetical order, then strcmp returns a negative value.
- If the first string comes after the second in alphabetical order, then strcmp returns a positive value.
- If the two strings are the same, then strcmp returns zero.
-
Dictionary order compares characters in strings one by one, until a difference is found.
- In determining dictionary order, the first characters of a string are compared. If they differ, then the underlying character encoding specifies which string comes first.
- If the first character of two strings, strcmp examines the second character of these strings. If they differ, then the underlying character encoding specifies which string comes first.
- Etc.
created 25 May 2016 by Henry M. Walker expanded and edited 27 May 2016 by Henry M. Walker |
![]() ![]() |
For more information, please contact Henry M. Walker at walker@cs.grinnell.edu. |