We have seen two basic elements in an expression:
A value expressed as a literal or a variable.
An operator.
A regular expression is made up of these same elements. Any character, except the metacharacters in Table 3.1, is interpreted as a literal that matches only itself.
Special | |
---|---|
Characters | Usage |
. | Matches any single character except newline. In awk, dot can match newline also. |
* | Matches any number (including zero) of the single character (including a character specified by a regular expression) that immediately precedes it. |
[...] | Matches any one of the class of characters enclosed between the brackets. A circumflex (^) as first character inside brackets reverses the match to all characters except newline and those listed in the class. In awk, newline will also match. A hyphen (-) is used to indicate a range of characters. The close bracket (]) as the first character in class is a member of the class. All other metacharacters lose their meaning when specified as members of a class. |
^ | First character of regular expression, matches the beginning of the line. Matches the beginning of a string in awk, even if the string contains embedded newlines. |
$ | As last character of regular expression, matches the end of the line. Matches the end of a string in awk, even if the string contains embedded newlines. |
\{n,m\} | Matches a range of occurrences of the single character (including a character specified by a regular expression) that immediately precedes it. \{n\} will match exactly n occurrences, \{n,\} will match at least n occurrences, and \{n,m\} will match any number of occurrences between n and m. (sed and grep only, may not be in some very old versions.) |
\ |
[2] Most awk implementations do not yet support this notation.
Metacharacters have a special meaning in regular expressions,
much the same way as +
and *
have special meaning in
arithmetic expressions.
Several of the metacharacters (+ ? () |)
are available only as part of the extended
set used by programs such as egrep and awk.
We will look at what each metacharacter does
in upcoming sections, beginning with the backslash.
The backslash (\) metacharacter transforms metacharacters into ordinary characters (and ordinary characters into metacharacters). It forces the literal interpretation of any metacharacter such that it will match itself. For instance, the dot (.) is a metacharacter that needs to be escaped with a backslash if you want to match a period. This regular expression matches a period followed by three spaces.
\.
The backslash is typically used to match troff requests or macros that begin with a dot.
\.nf
You can also use the backslash to escape the backslash. For instance, the font change request in troff is "\f". To search for lines containing this request, you'd use the following regular expression:
\\f
In addition, sed uses the backslash to cause a group of ordinary characters to be interpreted as metacharacters, as shown in Figure 3.2.
The n in the "\n" construct represents a digit from 1 to 9; its use will be explained in Chapter 5, Basic sed Commands.
The wildcard metacharacter, or dot (.), might be considered equivalent to a variable. A variable represents any value in an arithmetic expression. In a regular expression, a dot (.) is a wildcard that represents any character except the newline. (In awk, dot can even match an embedded newline character.)
Given that we are describing a sequence of characters, the wildcard metacharacter allows you to specify a position that any character can fill.
For instance, if we were searching a file containing a discussion of the Intel family of microprocessors, the following regular expression:
80.86
would match lines containing references to "80286," "80386," or "80486."[3] To match a decimal point or a period, you must escape the dot with a backslash.
[3] The Pentium family of microprocessors breaks our simple pattern-matching experiment, spoiling the fun. Not to mention the original 8086.
It is seldom useful to match just any character at the beginning or end of a pattern. Therefore, the wildcard character is usually preceded and followed by a literal character or other metacharacter. For example, the following regular expression might be written to search for references to chapters:
Chapter.
It searches for "the string `Chapter' followed by any character." In a search, this expression would turn up virtually the same matches as the fixed string pattern "Chapter". Look at the following example:
$ grep 'Chapter.' sample you will find several examples in Chapter 9. "Quote me 'Chapter and Verse'," she said. Chapter Ten
Searching for the string "Chapter" as opposed to "Chapter." would have matched all of the same lines. However, there is one case that would be different - if "Chapter" appeared at the end of a line. The wildcard does not match the newline, so "Chapter." would not match that line, while the fixed-string pattern would match the line.
For all practical purposes, you can rely on a program to produce the correct result. However, that doesn't mean the program always works correctly as far as you are concerned. Most of the time, you can bet that if a program does not produce the output that you expected, the real problem (putting aside input or syntax errors) is how you described what you wanted.
In other words, the place to look to correct the problem is the expression where you described the result you wanted. Either the expression is incomplete or it is improperly formulated. For instance, if a program evaluates this expression:
PAY = WEEKLY_SALARY * 52
and knows the values of these variables, it will calculate the correct result. But someone might object that the formula did not account for salespeople, who also receive a commission. To describe this instance, the expression would need to be reformulated as:
PAY = WEEKLY_SALARY * 52 + COMMISSION
You could say that whoever wrote the first expression did not fully understand the scope of the problem and thus did not describe it well. It is important to know just how detailed a description must be. If you ask someone to bring you a book, and there are multiple books in view, you need to describe more specifically the book that you want (or be content with an indeterminate selection process).
The same is true with regular expressions. A program such as grep is simple and easy to use. Understanding the elements of regular expressions is not so hard, either. Regular expressions allow you to write simple or complex descriptions of patterns. However, what makes writing regular expressions difficult (and interesting) is the complexity of the application: the variety of occurrences or contexts in which a pattern appears. This complexity is inherent in language itself, just as you can't always understand an expression by looking up each word in the dictionary.
The process of writing a regular expression involves three steps:
Knowing what it is you want to match and how it might appear in the text.
Writing a pattern to describe what you want to match.
Testing the pattern to see what it matches.
This process is virtually the same kind of process that a programmer follows to develop a program. Step 1 might be considered the specification, which should reflect an understanding of the problem to be solved as well as how to solve it. Step 2 is analogous to the actual coding of the program, and Step 3 involves running the program and testing it against the specification. Steps 2 and 3 form a loop that is repeated until the program works satisfactorily.
Testing your description of what you want to match ensures that the description works as expected. It usually uncovers a few surprises. Carefully examining the results of a test, comparing the output against the input, will greatly improve your understanding of regular expressions. You might consider evaluating the results of a pattern matching-operation as follows:
The lines that I wanted to match.
The lines that I didn't want to match.
The lines that I didn't match but wanted to match.
The lines that I matched but didn't want to match.
Trying to perfect your description of a pattern is something that you work at from opposite ends: you try to eliminate the false alarms by limiting the possible matches and you try to capture the omissions by expanding the possible matches.
The difficulty is especially apparent when you must describe patterns using fixed strings. Each character you remove from the fixed-string pattern increases the number of possible matches. For instance, while searching for the string "what," you determine that you'd like to match "What" as well. The only fixed-string pattern that will match "What" and "what" is "hat," the longest string common to both. It is obvious, though, that searching for "hat" will produce unwanted matches. Each character you add to a fixed-string pattern decreases the number of possible matches. The string "them" will usually produce fewer matches than the string "the."
Using metacharacters in patterns provides greater flexibility in extending or narrowing the range of matches. Metacharacters, used in combination with literals or other metacharacters, can be used to expand the range of matches while still eliminating the matches that you do not want.
A character class is a refinement of the wildcard concept. Instead of matching any character at a specific position, we can list the characters to be matched. The square bracket metacharacters ([]) enclose the list of characters, any of which can occupy a single position.
Character classes are useful for dealing with uppercase and lowercase letters, for instance. If "what" might appear with either an initial capital letter or a lowercase letter, you can specify:
[Ww]hat
This regular expression can match "what" or "What." It will match any line that contains this four-character string, the first character of which is either "W" or "w." Therefore, it could match "Whatever" or "somewhat."
If a file contained structured heading macros, such as .H1, .H2, .H3, etc., you could extract any of these lines with the regular expression:
\.H[12345]
This pattern matches a three-character string, where the last character is any number from 1 to 5.
The same syntax is used by the UNIX shell. Thus, you can use character classes to specify filenames in UNIX commands. For example, to extract headings from a group of chapter files, you might enter:
$ grep '\.H[123]' ch0[12] ch01:.H1 "Contents of Distribution Tape" ch01:.H1 "Installing the Software" ch01:.H1 "Configuring the System" ch01:.H2 "Specifying Input Devices" ch01:.H3 "Using the Touch Screen" ch01:.H3 "Using the Mouse" ch01:.H2 "Specifying Printers" ch02:.H1 "Getting Started" ch02:.H2 "A Quick Tour" . . .
Note that you have to quote the pattern so that it is passed on to grep rather than interpreted by the shell. The output produced by grep identifies the name of the file for each line printed. As another example of a character class, assume you want to specify the different punctuation marks that end a sentence:
.[!?;:,".].
This expression matches "any character followed by an exclamation mark or question mark or semicolon or colon or comma or quotation mark or period and then followed by two spaces and any character." It could be used to find places where two spaces had been left between the end of a sentence and the beginning of the next sentence, when this occurs on one line. Notice that there are three dots in this expression. The first and last dots are wildcard metacharacters, but the second dot is interpreted literally. Inside square brackets, the standard metacharacters lose their meaning. Thus, the dot inside the square brackets indicates a period. Table 3.2 lists the characters that have a special meaning inside square brackets.
Character | Function |
---|---|
\ | Escapes any special character (awk only) |
- | Indicates a range when not in the first or last position. |
^ | Indicates a reverse match only when in the first position. |
The backslash is special only in awk, making it possible to write "[a\]1]" for a character class that will match an a, a right bracket, or a 1.
The hyphen character (-) allows you to specify a range of characters. For instance, the range of all uppercase English letters[4] can be specified as:
[4] This can actually be very messy when working in non-ASCII character sets and/or languages other than English. The POSIX standard addresses this issue; the new POSIX features are presented below.
[A-Z]
A range of single-digit numbers can be specified as:
[0-9]
This character class helps solve an earlier problem of matching chapter references. Look at the following regular expression:
[cC]hapter [1-9]
It matches the string "chapter" or "Chapter" followed by a space and then followed by any single-digit number from 1 to 9. Each of the following lines match the pattern:
you will find the information in chapter 9 and chapter 12. Chapter 4 contains a summary at the end.
Depending upon the task, the second line in this example might be considered a false alarm. You might add a space following "[1-9]" to avoid matching two-digit numbers. You could also specify a class of characters not to be matched at that position, as we'll see in the next section. Multiple ranges can be specified as well as intermixed with literal characters:
[0-9a-z?,.;:'"]
This expression will match "any single character that is numeric, lowercase alphabetic, or a question mark, comma, period, semicolon, colon, single quote, or quotation mark." Remember that each character class matches a single character. If you specify multiple classes, you are describing multiple consecutive characters such as:
[a-zA-Z][.?!]
This expression will match "any lowercase or uppercase letter followed by either a period, a question mark, or an exclamation mark."
The close bracket (]) is interpreted as a member of the class if it occurs as the first character in the class (or as the first character after a circumflex; see the next section). The hyphen loses its special meaning within a class if it is the first or last character. Therefore, to match arithmetic operators, we put the hyphen (-) first in the following example:
[-+*/]
In awk, you could also use the backslash to escape the hyphen or close bracket wherever either one occurs in the range, but the syntax is messier.
Trying to match dates with a regular expression is an interesting problem. Here are two possible formats:
MM-DD-YY MM/DD/YY
The following regular expression indicates the possible range of values for each character position:
[0-1][0-9][-/][0-3][0-9][-/][0-9][0-9]
Either "-" or "/" could be the delimiter. Putting the hyphen in the first position ensures that it will be interpreted in a character class literally, as a hyphen, and not as indicating a range.[5]
[5] Note that the expression matches dates that mix their delimiters, as well as impossible dates like "15/32/78."
Normally, a character class includes all the characters that you want to match at that position. The circumflex (^) as the first character in the class excludes all of the characters in the class from being matched. Instead any character except newline[6] that is not listed in the square brackets will be matched. The following pattern will match any non-numeric character:
[6] In awk, newline can also be matched.
[^0-9]
It matches all uppercase and lowercase letters of the alphabet and all special characters such as punctuation marks.
Excluding specific characters is sometimes more convenient than explicitly listing all the characters you want to match. For instance, if you wanted to match any consonant, you could simply exclude vowels:
[^aeiou]
This expression would match any consonant, any vowel in uppercase, and any punctuation mark or special character.
Look at the following regular expression:
\.DS "[^1]"
This expression matches the string ".DS" followed by a space, a quote followed by any character other than the number "1," followed by a quote.[7] It is designed to avoid matching the following line:
[7] When typing this pattern at the command line, be sure to enclose it in single quotes. The ^ is special to the original Bourne shell.
.DS "1"
while matching lines such as:
.DS "I" .DS "2"
This syntax can also be used to limit the extent of a match, as we'll see up ahead.
The POSIX standard formalizes the meaning of regular expression characters and operators. The standard defines two classes of regular expressions: Basic Regular Expressions (BREs), which are the kind used by grep and sed, and Extended Regular Expressions, which are the kind used by egrep and awk.
In order to accommodate non-English environments, the POSIX standard enhanced the ability of character classes to match characters not in the English alphabet. For example, the French è is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data.
POSIX also changed what had been common terminology. What we've been calling a "character class" is called a "bracket expression" in the POSIX standard. Within bracket expressions, beside literal characters such as a, !, and so on, you can have additional components. These are:
Character classes. A POSIX character class consists of keywords bracketed by [: and :]. The keywords describe different classes of characters such as alphabetic characters, control characters, and so on (see Table 3.3).
Collating symbols. A collating symbol is a multicharacter sequence that should be treated as a unit. It consists of the characters bracketed by [. and .].
Equivalence classes. An equivalence class lists a set of characters that should be considered equivalent, such as e and è. It consists of a named element from the locale, bracketed by [= and =].
All three of these constructs must appear inside the square brackets of a bracket expression. For example [[:alpha:]!] matches any single alphabetic character or the exclamation point, [[.ch.]] matches the collating element ch, but does not match just the letter c or the letter h. In a French locale, [[=e=]] might match any of e, è, or é. Classes and matching characters are shown in Table 3.3.
Class | Matching Characters |
---|---|
[:alnum:] | Printable characters (includes whitespace) |
[:alpha:] | Alphabetic characters |
[:blank:] | Space and tab characters |
[:cntrl:] | Control characters |
[:digit:] | Numeric characters |
[:graph:] | Printable and visible (non-space) characters |
[:lower:] | Lowercase characters |
[:print:] | Alphanumeric characters |
[:punct:] | Punctuation characters |
[:space:] | Whitespace characters |
[:upper:] | Uppercase characters |
[:xdigit:] | Hexadecimal digits |
These features are slowly making their way into commercial versions of sed and awk, as vendors fully implement the POSIX standard. GNU awk and GNU sed support the character class notation, but not the other two bracket notations. Check your local system documentation to see if they are available to you.
Because these features are not widely available yet, the scripts in this book will not rely on them, and we will continue to use the term "character class" to refer to lists of characters in square brackets.
The asterisk (*) metacharacter indicates that the preceding regular expression may occur zero or more times. That is, if it modifies a single character, the character may be there or not, and if it is, there may be more than one of them. You could use the asterisk metacharacter to match a word that might appear in quotes.
"*hypertext"*
The word "hypertext" will be matched regardless of whether it appears in quotes or not.
Also, if the literal character modified by the asterisk does exist, there could be more than one occurrence. For instance, let's examine a series of numbers:
1 5 10 50 100 500 1000 5000
The regular expression
[15]0*
would match all lines, whereas the regular expression
[15]00*
would match all but the first two lines. The first zero is a literal, but the second is modified by the asterisk, meaning it might or might not be present. A similar technique is used to match consecutive spaces because you usually want to match one or more, not zero or more, spaces. You can use the following to do that:
*
When preceded by a dot metacharacter, the asterisk metacharacter matches any number of characters. It can be used to identify a span of characters between two fixed strings. If you wanted to match any string inside of quotation marks, you could specify:
".*"
This would match all characters between the first and last quotation marks on the line plus the quotation marks. The span matched by ".*" is always the longest possible. This may not seem important now but it will be once you learn about replacing the string that was matched.
As another example, a pair of angle brackets is a common notation for enclosing formatting instructions used in markup languages, such as SGML, HTML, and Ventura Publisher.
You could print all lines with these marks by specifying:
$grep '<.*>' sample
When used to modify a character class, the asterisk can match any number of a character in that class. For instance, look at the following five-line sample file:
I can do it I cannot do it I can not do it I can't do it I cant do it
If we wanted to match each form of the negative statement, but not the positive statement, the following regular expression would do it:
can[no']*t
The asterisk causes any of the characters in the class to be matched in any order and for any number of occurrences. Here it is:
$grep "can[no']*t" sample
I cannot do it I can not do it I can't do it I cant do it
There are four hits and one miss, the positive statement. Notice that had the regular expression tried to match any number of characters between the string "can" and "t," as in the following example:
can.*t
it would have matched all lines.
The ability to match "zero or more" of something is known by the technical term "closure." The extended set of metacharacters used by egrep and awk provides several variations of closure that can be quite useful. The plus sign (+) matches one or more occurrences of the preceding regular expression. Our earlier example of matching one or more spaces can be simplified as such:
+
The plus sign metacharacter can be thought of as "at least one" of the
preceding character. In fact, it better corresponds to how many people
think *
works.
The question mark
(?
) matches zero or one occurrences.
For instance, in a previous example, we used a regular
expression to match "80286," "80386," and "80486."
If we wanted to also match the string "8086,"
we could write a regular expression that could be
used with egrep or awk:
80[234]?86
It matches the string "80" followed by a "2," a "3," a "4," or no
character followed by the string "86."
Don't confuse the ?
in a regular expression with the ?
wildcard in the
shell. The shell's ?
represents a single character, equivalent to .
in a
regular expression.
As you have probably figured out, it is sometimes difficult to match a complete word. For instance, if we wanted to match the pattern "book," our search would hit lines containing the word "book" and "books" but also the words "bookish," "handbook," and "booky." The obvious thing to do to limit the matching is to surround "book" with spaces.
book
However, this expression would only match the word "book"; it would miss the plural "books". To match either the singular or plural word, you could use the asterisk metacharacter:
books*
This will match "book" or "books". However, it will not match "book" if it is followed by a period, a comma, a question mark, or a quotation mark.
When you combine the asterisk with the wildcard metacharacter (.), you can match zero or more occurrences of any character. In the previous example, we might write a fuller regular expression as:
book.*
This expression matches the string "book" followed by "any number of characters or none followed by a space." Here are a few lines that would match:
Here are the books that you requested Yes, it is a good book for children It is amazing to think that it was called a "harmful book" when once you get to the end of the book, you can't believe
(Note that only the second line would be matched by the fixed string "book".) The expression "book.*" matches lines containing words such as "booky," "bookworm," and "bookish." We could eliminate two of these matches by using a different modifier. The question mark (?), which is part of the extended set of metacharacters, matches 0 or 1 occurrences of the preceding character. Thus, the expression:
book.?
would match "book," "books," and "booky" but not "bookish" and "bookworm." To eliminate a word like "booky," we would have to use character classes to specify all the characters in that position that we want to match. Furthermore, since the question mark metacharacter is not available with sed, we would have to resort to character classes anyway, as you'll see later on.
Trying to be all-inclusive is not always practical with a regular expression, especially when using grep. Sometimes it is best to keep the expression simple and allow for the misses. However, as you use regular expressions in sed for making replacements, you will need to be more careful that your regular expression is complete. We will look at a more comprehensive regular expression for searching for words in Part II of "What's the Word?" later in this chapter.
There are two metacharacters that allow you to specify the context in which a string appears, either at the beginning of a line or at the end of a line. The circumflex (^) metacharacter is a single-character regular expression indicating the beginning of a line. The dollar sign ($) metacharacter is a single-character regular expression indicating the end of a line. These are often referred to as "anchors," since they anchor, or restrict, the match to a specific position. You could print lines that begin with a tab:
^
(The represents a literal tab character, which is normally invisible.) Without the ^ metacharacter, this expression would print any line containing a tab.
Normally, using vi to input text to be processed by troff, you do not want spaces appearing at the end of lines. If you want to find (and remove) them, this regular expression will match lines with one or more spaces at the end of a line:
*$
troff requests and macros must be input at the beginning of a line. They are two-character strings, preceded by a dot. If a request or macro has an argument, it is usually followed by a space. The regular expression used to search for such requests is:
^\...
This expression matches "a dot at the beginning of a line followed by any two-character string, and then followed by a space."
You can use both positional metacharacters together to match blank lines:
^$
You might use this pattern to count the number of blank lines in a file using the count option, -c, to grep:
$grep -c '^$' ch04
5
This regular expression is useful if you want to delete blank lines using sed. The following regular expression can be used to match a blank line even if it contains spaces:
^*$
Similarly, you can match the entire line:
^.*$
which is something you might possibly want to do with sed.
In sed (and grep), "^" and "$" are only special when they occur at the beginning or end of a regular expression, respectively. Thus "^abc" means "match the letters a, b, and c only at the beginning of the line," while "ab^c" means "match a, b, a literal ^, and then c, anywhere on the line." The same is true for the "$."
In awk, it's different; "^" and "$" are always special, even though it then becomes possible to write regular expressions that don't match anything. Suffice it to say that in awk, when you want to match either a literal "^" or "$," you should always escape it with a backslash, no matter what its position in the regular expression.
A pattern-matching program such as grep does not match a string if it extends over two lines. For all practical purposes, it is difficult to match phrases with assurance. Remember that text files are basically unstructured and line breaks are quite random. If you are looking for any sequence of words, it is possible that they might appear on one line but they may be split up over two.
You can write a series of regular expression to capture a phrase:
Almond Joy Almond$ ^Joy
This is not perfect, as the second regular expression will match "Almond" at the end of a line, regardless of whether or not the next line begins with "Joy". A similar problem exists with the third regular expression.
Later, when we look at sed, you'll learn how to match patterns over multiple lines and you'll see a shell script incorporating sed that makes this capability available in a general way.
The metacharacters that allow you to specify repeated occurrences of a character (*+?) indicate a span of undetermined length. Consider the following expression:
11*0
It will match each of the following lines:
10 110 111110 1111111111111111111111111110
These metacharacters give elasticity to a regular expression.
Now let's look at a pair of metacharacters that allow you to indicate a span and also determine the length of the span. So, you can specify the minimum and maximum number of occurrences of a literal character or regular expression.
\{ and \} are available in grep and sed.[8] POSIX egrep and POSIX awk use { and }. In any case, the braces enclose one or two arguments.
[8] Very old versions may not have them; Caveat emptor.
\{n,m\}
n and m are integers between 0 and 255. If you specify \{n\} by itself, then exactly n occurrences of the preceding character or regular expression will be matched. If you specify \{n,\}, then at least n occurrences will be matched. If you specify \{n,m\}, then any number of occurrences between n and m will be matched.[9]
[9] Note that "?" is equivalent to "\{0,1\}", "*" is equivalent to "\{0,\}", "+" is equivalent to "\{1,\}", and no modifier is equivalent to "\{1\}".
For example, the following expression will match "1001," "10001," and "100001" but not "101" or "1000001":
10\{2,4\}1
This metacharacter pair can be useful for matching data in fixed-length fields, data that perhaps was extracted from a database. It can also be used to match formatted data such as phone numbers, U.S. social security numbers, inventory part IDs, etc. For instance, the format of a social security number is three digits, a hyphen, followed by two digits, a hyphen, and then four digits. That pattern could be described as follows:
[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}
Similarly, a North American local phone number could be described with the following regular expression:
[0-9]\{3\}-[0-9]\{4\}
If you are using pre-POSIX awk, where you do not have braces available, you can simply repeat the character classes the appropriate number of times:
[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
The vertical bar (|) metacharacter, part of the extended set of metacharacters, allows you to specify a union of regular expressions. A line will match the pattern if it matches one of the regular expressions. For instance, this regular expression:
UNIX|LINUX
will match lines containing either the string "UNIX" or the string "LINUX". More than one alternative can be specified:
UNIX|LINUX|NETBSD
A line matching any of these three patterns will be printed by egrep.
In sed, lacking the union metacharacter, you would specify each pattern separately. In the next section, where we look at grouping operations, we will see additional examples of this metacharacter.
Parentheses, (), are used to group regular expressions and establish precedence. They are part of the extended set of metacharacters. Let's say that a company's name in a text file is referred to as "BigOne" or "BigOne Computer":
BigOne(Computer)?
This expression will match the string "BigOne" by itself or followed by a single occurrence of the string "Computer". Similarly, if a term is sometime spelled out and at other times abbreviated:
$egrep "Lab(oratorie)?s" mail.list
Bell Laboratories, Lucent Technologies Bell Labs
You can use parentheses with a vertical bar to group alternative operations. In the following example, we use it to specify a match of the singular or plural of the word "company."
compan(y|ies)
It is important to note that applying a quantifier to a parenthesized group of characters can't be done in most versions of sed and grep, but is available in all versions of egrep and awk.
Let's reevaluate the regular expression for searching for a single word in light of the new metacharacters we've discussed. Our first attempt at writing a regular expression for grep to search for a word concluded with the following expression:
book.*
This expression is fairly simple, matching a space followed by the string "book" followed by any number of characters followed by a space. However, it does not match all possible occurrences and it does match a few nuisance words.
The following test file contains numerous occurrences of "book." We've added a notation, which is not part of the file, to indicate whether the input line should be a "hit" (>) and included in the output or a "miss" (<). We've tried to include as many different examples as possible.
$cat bookwords
> This file tests for book in various places, such as > book at the beginning of a line or > at the end of a line book > as well as the plural books and < handbooks. Here are some < phrases that use the word in different ways: > "book of the year award" > to look for a line with the word "book" > A GREAT book! > A great book? No. > told them about (the books) until it > Here are the books that you requested > Yes, it is a good book for children > amazing that it was called a "harmful book" when > once you get to the end of the book, you can't believe < A well-written regular expression should < avoid matching unrelated words, < such as booky (is that a word?) < and bookish and < bookworm and so on.
As we search for occurrences of the word "book," there are 13 lines that should be matched and 7 lines that should not be matched. First, let's run the previous regular expression on the sample file and check the results.
$grep 'book.*' bookwords
This file tests for book in various places, such as as well as the plural books and A great book? No. told them about (the books) until it Here are the books that you requested Yes, it is a good book for children amazing that it was called a "harmful book" when once you get to the end of the book, you can't believe such as booky (is that a word?) and bookish and
It only prints 8 of the 13 lines that we want to match and it prints 2 of the lines that we don't want to match. The expression matches lines containing the words "booky" and "bookish." It ignores "book" at the beginning of a line and at the end of a line. It ignores "book" when there are certain punctuation marks involved.
To restrict the search even more, we must use character classes. Generally, the list of characters that might end a word are punctuation marks, such as:
? . , ! ; : '
In addition, quotation marks, parentheses, braces, and brackets might surround a word or open or close with a word:
" () {} []
You would also have to accommodate the plural or possessive forms of the word.
Thus, you would have two different character classes: before and after the word. Remember that all we have to do is list the members of the class inside square brackets. Before the word, we now have:
["[{(]
and after the word:
[]})"?!.,;:'s]
Note that putting the closing square bracket as the first character in the class makes it a member of the class rather than closing the set. Putting the two classes together, we get the expression:
["[{(]*book[]})"?!.,;:'s]*
Show this to the uninitiated, and they'll throw up their hands in despair! But now that you know the principles involved, you can not only understand this expression, but could easily reconstruct it. Let's see how it does on the sample file (we use double quotes to enclose the single quote character, and then a backslash in front of the embedded double quotes):
$grep " [\"[{(]*book[]})\"?!.,;:'s]* " bookwords
This file tests for book in various places, such as as well as the plural books and A great book? No. told them about (the books) until it Here are the books that you requested Yes, it is a good book for children amazing that it was called a "harmful book" when once you get to the end of the book, you can't believe
We eliminated the lines that we don't want but there are four lines that we're not getting. Let's examine the four lines:
book at the beginning of a line or at the end of a line book "book of the year award" A GREAT book!
All of these are problems caused by the string appearing at the beginning or end of a line. Because there is no space at the beginning or end of a line, the pattern is not matched. We can use the positional metacharacters, ^ and $. Since we want to match either a space or beginning or end of a line, we can use egrep and specify the "or" metacharacter along with parentheses for grouping. For instance, to match either the beginning of a line or a space, you could write the expression:
(^| )
(Because | and () are part of the extended set of metacharacters, if you were using sed, you'd have to write different expressions to handle each case.)
Here's the revised regular expression:
(^| )["[{(]*book[]})"?\!.,;:'s]*( |$)
Now let's see how it works:
$egrep "(^| )[\"[{(]*book[]})\"?\!.,;:'s]*( |$)" bookwords
This file tests for book in various places, such as book at the beginning of a line or at the end of a line book as well as the plural books and "book of the year award" to look for a line with the word "book" A GREAT book! A great book? No. told them about (the books) until it Here are the books that you requested Yes, it is a good book for children amazing that it was called a "harmful book" when once you get to the end of the book, you can't believe
This is certainly a complex regular expression; however, it can be broken down into parts. This expression may not match every single instance, but it can be easily adapted to handle other occurrences that you may find.
You could also create a simple shell script to replace "book" with a command-line argument. The only problem might be that the plural of some words is not simply "s." By sleight of hand, you could handle the "es" plural by adding "e" to the character class following the word; it would work in many cases.
As a further note, the ex and vi text editors have a special metacharacter for matching a string at the beginning of a word, \<, and one for matching a string at the end of a word, \>. Used as a pair, they can match a string only when it is a complete word. (For these operators, a word is a string of non-whitespace characters with whitespace on both sides, or at the beginning or end of a line.) Matching a word is such a common case that these metacharacters would be widely used, if they were available for all regular expressions.[10]
[10] GNU programs, such as the GNU versions of awk, sed, and grep, also support \< and \>.
When using grep, it seldom matters how you match the line as long as you match it. When you want to make a replacement, however, you have to consider the extent of the match. So, what characters on the line did you actually match?
In this section, we're going to look at several examples that demonstrate the extent of a match. Then we'll use a program that works like grep but also allows you to specify a replacement string. Lastly, we will look at several metacharacters used to describe the replacement string.
Let's look at the following regular expression:
A*Z
This matches "zero or more occurrences of A followed by Z." It will produce the same result as simply specifying "Z". The letter "A" could be there or not; in fact, the letter "Z" is the only character matched. Here's a sample two-line file:
All of us, including Zippy, our dog Some of us, including Zippy, our dog
If we try to match the previous regular expression, both lines would print out. Interestingly enough, the actual match in both cases is made on the "Z" and only the "Z". We can use the gres command (see the sidebar, "A Program for Making Single Replacements") to demonstrate the extent of the match.
$ gres "A*Z" "00" test
All of us, including 00ippy, our dog Some of us, including 00ippy, our dog
We would have expected the extent of the match on the first line to be from the "A" to the "Z" but only the "Z" is actually matched. This result may be more apparent if we change the regular expression slightly:
A.*Z
".*" can be interpreted as "zero or more occurrences of any character," which means that "any number of characters" can be found, including none at all. The entire expression can be evaluated as "an A followed by any number of characters followed by a Z." An "A" is the initial character in the pattern and "Z" is the last character; anything or nothing might occur in between. Running grep on the same two-line file produces one line of output. We've added a line of carets (^) underneath to mark what was matched.
All of us, including Zippy, our dog ^^^^^^^^^^^^^^^^^^^^^^
The extent of the match is from "A" to "Z". The same regular expression would also match the following line:
I heard it on radio station WVAZ 1060. ^^
The string "A.*Z" matches "A followed by any number of characters (including zero) followed by Z." Now, let's look at a similar set of sample lines that contain multiple occurrences of "A" and "Z".
All of us, including Zippy, our dog All of us, including Zippy and Ziggy All of us, including Zippy and Ziggy and Zelda
The regular expression "A.*Z" will match the longest possible extent in each case.
All of us, including Zippy, our dog ^^^^^^^^^^^^^^^^^^^^^^ All of us, including Zippy and Ziggy ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ All of us, including Zippy and Ziggy and Zelda ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This can cause problems if what you want is to match the shortest extent possible.
Earlier we said that a regular expression tries to match the longest string possible and that can cause unexpected problems. For instance, look at the regular expression to match any number of characters inside of quotation marks:
".*"
Let's look at a troff macro that has two quoted arguments, as shown below:
.Se "Appendix" "Full Program Listings"
To match the first argument, we might describe the pattern with the following regular expression:
\.Se ".*"
However, it ends up matching the whole line because the second quotation mark in the pattern matches the last quotation mark on the line. If you know how many arguments there are, you can specify each of them:
\.Se ".*" ".*"
Although this works as you'd expect, each line might not have the same number of arguments, causing omissions - you simply want the first argument. Here's a different regular expression that matches the shortest possible extent between two quotation marks:
"[^"]*"
It matches "a quote followed by any number of characters that do not match a quote followed by a quote":
$gres '"[^"]*"' '00' sampleLine
.Se 00 "Appendix"
Now let's look at a few lines with a dot character (.) used as a leader between two columns of numbers:
1........5 5........10 10.......20 100......200
The difficulty in matching the leader characters is that their number is variable. Let's say that you wanted to replace all of the leaders with a single tab. You might write a regular expression to match the line as follows:
[0-9][0-9]*\.\.*[0-9][0-9]*
This expression might unexpectedly match the line:
see Section 2.3
To restrict matching, you could specify the minimum number of dots that are common to all lines:
[0-9][0-9]*\.\{5,\}[0-9][0-9]*
This expression uses braces available in sed to match "a single number followed by at least five dots and then followed by a single number." To see this in action, we'll show a sed command that replaces the leader dots with a hyphen. However, we have not covered the syntax of sed's replacement metacharacters - \( and \) to save a part of a regular expression and \1 and \2 to recall the saved portion. This command, therefore, may look rather complex (it is!) but it does the job.
$ sed 's/\([0-9][0-9]*\)\.\{5,\}\([0-9][0-9]*\)/\1-\2/' sample 1-5 5-10 10-20 100-200
A similar expression can be written to match one or more leading tabs or tabs between columns of data. You could change the order of columns as well as replacing the tab with another delimiter. You should experiment on your own by making simple and complex replacements, using sed or gres.