[Chapter 26] 26.7 Limiting the Extent of a Match

26.7 Limiting the Extent of a Match

A regular expression tries to match the longest string possible; that can cause unexpected problems. For instance, look at the following regular expression, which matches any number of characters inside of quotation marks:

".*"

Let's look at a troff macro that has two quoted arguments, as shown below:

.Se "Appendix" "Full Program Listings"

To match the first argument, a novice might describe the pattern with the following regular expression:

\.Se ".*"

However, the pattern ends up matching the whole line because the second quotation mark in the pattern matches the last quotation mark on the line. If you know how many arguments there are, you can specify each of them:

\.Se ".*" ".*"

Although this works as you'd expect, each line might not have the same number of arguments, causing misses that should be hits - you simply want the first argument. Here's a different regular expression that matches the shortest possible extent between two quotation marks:

"[^"]*"

It matches "a quote, followed by any number of characters that do not match a quote, followed by a quote." The use of what we might call "negated character classes" like this is one of the things that distinguishes the journeyman regular expression user from the novice. [ Perl 5 (37.5) has added a new "non-greedy" regular expression operator that matches the shortest string possible. -JP ]

- DD from O'Reilly & Associates' sed & awk


26.6 Just What Does a Regular Expression Match?		26.8 I Never Meta Character I Didn't Like