One of Perl's original applications was text processing (see section A Brief History of Perl). So far, we have seen easy manipulation of scalar and list data is in Perl, but we have yet to explore the core of Perl's text processing construct--regular expressions. To remedy that, this chapter is devoted completely to regular expressions.
Regular expressions are a concept borrowed from automata theory. Regular expressions provide a a way to describe a "language" of strings.
The term, language, when used in the sense borrowed from automata theory, can be a bit confusing. A language in automata theory is simply some (possibly infinite) set of strings. Each string (which can be possibly empty) is composed of a set of characters from a fixed, finite set. In our case, this set will be all the possible @acronym{ASCII} characters(10) characters.}.
When we write a regular expression, we are writing a description of some set of possible strings. For the regular expression to have meaning, this set of possible strings that we are defining should have some meaning to us.
Regular expressions give us extreme power to do pattern matching on text documents. We can use the regular expression syntax to write a succinct description of the entire, infinite class of strings that fit our specification. In addition, anyone else who understands the description language of regular expressions, can easily read out description and determine what set of strings we want to match. Regular expressions are a universal description for matching regular strings.
When we discuss regular expressions, we discuss "matching". If a regular expression "matches" a given string, then that string is in the class we described with the regular expression. If it does not match, then the string is not in the desired class.
We can start our discussion of regular expression by considering the simplest of operators that can actually be used to create all possible regular expressions (11). All the other regular expression operators can actually be reduced into a set of these simple operators.
In regular expressions, generally, a character matches itself. The only
exceptions are regular expression special characters. To match one of
these special characters, you must put a \
before the character.
For example, the regular expression abc
matches a set of strings
that contain abc
somewhere in them. Since *
happens to be
a regular expression special character, the regular expression \*
matches any string that contains the *
character.
As we mentioned *
is a regular expression special character. The
*
is used to indicate that zero or more of the previous
characters should be matched. Thus, the regular expression a*
will match any string that contains zero or more a
's.
Note that since a*
will match any string with zero or more
a
's, a*
will match all strings, since all strings
(including the empty string) contain at least zero a
's. So,
a*
is not a very useful regular expression.
A more useful regular expression might be baa*
. This regular
expression will match any string that has a b
, followed by one or
more a
's. Thus, the set of strings we are matching are those
that contain ba
, baa
, baaa
, etc. In other words,
we are looking to see if there is any "sheep speech" hidden in our
text.
The next special character we will consider is the .
character. The
.
will match any valid character. As an example, consider the
regular expression a.c
. This regular expression will match any
string that contains an a
and a c
, with any possible character
in between. Thus, strings that contain abc
, acc
, amc
,
etc. are all in the class of strings that this regular expression
matches.
The |
special character is equivalent to an "or" in regular
expressions. This character is used to give a choice. So, the regular
expression abc|def
will match any string that contains either
abc
or def
.
Sometimes, within regular expressions, we want to group things together.
Doing this allows building of larger regular expressions based on smaller
components. The ()
's are used for grouping.
For example, if we want to match any string that contains abc
or
def
, zero or more times, surrounded by a xx
on either side,
we could write the regular expression xx(abc|def)*xx
. This
applies the *
character to everything that is in the parentheses.
Thus we can match any strings such as xxabcxx
, xxabcdefxx
,
etc.
Sometimes, we want to apply the regular expression from a defined point. In other words, we want to anchor the regular expression so it is not permitted to match anywhere in the string, just from a certain point.
The anchor operators allow us to do this. When we start a regular
expression with a ^
, it anchors the regular expression to the
beginning of the string. This means that whatever the regular
expression starts with must be matched at the beginning of the
string. For example, ^aa*
will not match strings that contain
one or more a
's; rather it matches strings that start with
one or more a
's.
We can also use the $
at the end of the string to anchor the
regular expression at the end of the string. If we applied this to our
last regular expression, we have ^aa*$
which now matches
only those strings that consist of one or more a
's. This
makes it clear that the regular expression cannot just look anywhere in
the string, rather the regular expression must be able to match the
entire string exactly, or it will not match at all.
In most cases, you will want to either anchor a regular expression to the start of the string, the end of the string, or both. Using a regular expression without some sort of anchor can also produce confusing and strange results. However, it is occasionally useful.
Now that you are familiar with some of the basics of regular
expressions, you probably want to know how to use them in Perl. Doing
so is very easy. There is an operator, =~
, that you can use to
match a regular expression against scalar variables. Regular
expressions in Perl are placed between two forward slashes (i.e.,
//
). The whole $scalar =~ //
expression will evaluate to
1
if a match occurs, and undef
if it does not.
Consider the following code sample:
use strict; while ( defined($currentLine = <STDIN>) ) { if ($currentLine =~ /^(J|R)MS speaks:/) { print $currentLine; } }
This code will go through each line of the input, and print only those lines that start with "JMS speaks:" or "RMS speaks:".
Writing out regular expressions can be problematic. For example, if we want to have a regular expression that matches all digits, we have to write:
(0|1|2|3|4|5|6|7|8|9)
It would be terribly annoying to have to write such things out. So, Perl gives an incredible number of shortcuts for writing regular expressions. These are largely syntactic sugar, since we could write out regular expressions in the same way we did above. However, that is too cumbersome.
For example, for ranges of values, we can use the brackets, []
's.
So, for our digit expression above, we can write [0-9]
. In fact,
it is even easier in perl, because \d
will match that very same
thing.
There are lots of these kinds of shortcuts. They are listed in the `perlre' online manual. They are listed in many places, so there is no need to list them again here.
However, as you learn about all the regular expression shortcuts,
remember that they can all be reduced to the original operators we
discussed above. They are simply short ways of saying things that can
be built with regular characters, *
, ()
, and |
.
Go to the first, previous, next, last section, table of contents.