Wordsmyth's Corner

Perl Primer - Chapter 6 - Regular Expressions

by Linda Naughton

Topics
Previous Chapter Next Chapter


What are Regular Expressions?

Regular expressions (regexps) are one of the most powerful tools in Perl.
They are like their own mini programming language.
Regular expressions define patterns for searching strings.

A few examples of things you can do with regular expressions (by no means is this a complete list):

  • Does a string contain the word "foo"?
  • Find all instances of the word "foo" and replace them with "bar".
  • Split a comma-separated string into an array containing each item.
  • Extract a piece of a string into a variable.

Regexp Definitions

A regexp definition is generally encased within forward slashes:
    /regexp/
For find/replace operations, you'll see two sets of slashes (one containing the "find" expression and the other containing the "replace" expression).
    /regexp/regexp/
There are sometimes special key-letters ("flags") before or after the string to denote special actions. So a really complicate regexp may look like:
    flags/regexp/regexp/flags
Technically you can use a delimeter other than the forward slash, but this is generally not done.

Common Flags

Some of the common flag actions are listed below, along with whether they appear at the beginning or end of the forward slashes:
   s = Substitution (beginning)
   s = Allows dot wildcard to match newlines (end)
   i = Case-insensitive matching (end)
   g = Global substitution (end)
   x = Allow whitespace for regexp commenting (end)
These are described in more detail in their respective sections.

Simple Pattern Matching

The most common use of regexp is to identify whether a string matches a certain pattern using the "binding operator" (=~).
The syntax is:
    # Returns true if the string matches the pattern
    # specified by the regexp
    if ($string =~ /regexp/)
You can also do something if the string doesn't match the pattern.
    if ($string !~ /regexp/)
Technically you can omit the binding operator, and the regexp will operate on $_. However, this practice is generally discouraged by our coding guidelines.
    while ()
    {
    # See if the line of this file matches.
    if (/regexp/)
       {
       print "Match!\n";
       }
    }

Regular Expression Basics

You could write an entire book on how to write regular expressions (and in fact there are several already). This just covers a few of the basics.

The simplest regexp is just a string (without quotes):

    if ($string =~ /foo/)
This will be true if any part of the string contains foo - even if it's part of another word.
    The most common variable example is foo.   # Matches
    My food was tasty.    # Matches
By default, matching is case-sensitive. You can make it insensitive by throwing the "i" flag on the end.
    if ($string =~ /foo/i)    # Case-insensitive matching
You can also use variables within regexps.
    # Match against whatever's in pattern
    if ($string =~ /$pattern/)  
TRY IT: Regular Expression Basics

Wildcards

Lots of wildcards exist. These operate only on a single character. Note: This is not a complete list of wildcards, just some of the common ones.
    .  = Matches any single character (except newlines).
    \d = Matches a digit (0-9)
    \w = Matches an alphanumeric "word" character (a-z, A-Z, 0-9, underscore)
    \s = Matches a whitespace character (space, \n, \t, etc.)
Some examples:
    /foo\sbar/    # Matches foo bar or foo\nbar
    /foo\d/       # Matches foo1, foo2, foo3, etc.
    /\wfoo/       # Matches afoo, bfoo, etc.
    /foo./        # Matches foo1, foo_, etc.
Most of the letter-based wildcards have mirror images if capitalized. So \D will match a NON-digit character, \S will match a NON-whitespace character, etc.

Note that the * is not a wildcard! It has other meaning, explained below.

TRY IT: Regexp Wildcards

Quantifiers

The wildcards match only single characters. So what do you do if you want to match a bunch in a row?

One option is to just repeat the wildcard:

    /\w\w/    # Matches 2 word characters
Another option is to use quantifiers. The quantifiers are:
    + = Matches ONE OR MORE of the specified char
    * = Matches ZERO OR MORE of the specified char
    ? = Matches ZERO OR ONE of the specified char
The most common use of the * quantifier is to match "any old junk":
    /foo.*bar/   # Matches foobar
Some other examples:
    # Matches foo bar or foo\nbar but NOT
    # foobar (since 1 or more spaces req'd).
    /foo\s+bar/

    # Matches foo bar or foo\nbar or
    # foobar (since 0 or more spaces req'd).
    /foo\s*bar/

    # Matches foo bar or foobar but NOT
    # foo   bar  (since only 1 space is permitted)
    /foo\s?bar/
Quantifiers can also be used on regular letters in addition to wildcards.
    /fo*bar/   # Matches fbar, foobar, fooobar, etc.
Be careful with the * quantifier, since it will happily match even when nothing's there.
    /f*/   # Matches anything, since the f is optional
Lastly, you can match something a specified number of times using curly braces.
   /a{5,15}/   # Matches aaaaa, aaaaaa, up to 15 a's.

TRY IT: Regexp Quantifiers

Escaping Special Characters

If you want to match an actual period, backslash, asterix, comma, etc. in a regexp you need to escape it using a backslash.
    /AB\.DE/;    # Matches AB.DE
    /AB\\DE/;    # Matches AB\DE

Classes

Use square brackets to define a "character class", which will match any character in the specified list. The \w wildcard, for example, is just a shortcut for a character class:
   /[a-z0-9]/   # Same as \w
You can match whatever characters you want in your class.
   /[a-dnw-z]/   # Matches a, b, c, d, n, w, x, y or z
Remember that these are case-sensitive unless you use the Ôi' flag. You could also explicitly make it case-insensitive by including both cases:
    /[a-ZA-Z]/   # Case-insensitive match

TRY IT: Regexp Character Classes

Groups

Use parens to create "groups" within regexps.

This can be used to apply quantifiers to entire words.

    /(fred)+/      # Matches fred, fredfred, etc.
You can also use them to match alternatives.
    /(foo|bar)/    # Matches foo OR bar

TRY IT: Regexp Groups

Anchors

Anchors can be used when you want to match whole words, or whole phrases. Common anchors:
    ^  - Beginning of string
    $  - End of string (with or without newline)
    \b - End of word

    # Match fred but not Frederick or Alfred 
    $string =~ /\bfred\b/;    

    # Match my phrase, as long as it's the only 
    # thing in the string.
    $string =~ /^my phrase$/; 
Note: "End of word" means a non-word (\w) character. So apostrophes, quotes, and hyphens will be considered word boundaries.

TRY IT: Regexp Anchors

Regexp Memory

Regexps "remember" things that match groups. These are automatically placed into the implicit variables $1 (for the first group), $2 (for the second group), etc.
    # If the string was "fred wilma", 
    #        $1 would store fred and $2 wilma
    # If the string was "barney wilma", 
    #        $1 would store barney and $2 wilma
    # If the string was "harvey jane", 
    #        there is no match and $1/$2 would be 
    #        unreliable 
    if ($string =~ /(fred|barney)\s+(wilma|betty))
You can assign the regexp memory to your own variables because the regexp evaluation returns a list containing $1, $2, etc.
    if (($husband, $wife) = 
         ($string =~ /(fred|barney)\s+(wilma|betty)))
Note: The regexp memory is persistent until the next successful match. So don't attempt to use $1, $2, etc. if the match failed. They are not necessarily undef!

TRY IT: Regexp Memory

Substitutions (find/replace)

Another common use of regexps is to perform find/replace actions. This requires the "s" flag (for "substitution") in front of the first forward slash.
    $string =~ s/regexp/regexp/;
The simple case is to just use a string to find and a string to replace.
    $string =~ s/foo/bar/;    # Replaces "foo" with "bar"
By default, a substitution will only replace the first instance. To replace all of them, use the "g" flag.
    $string =~ s/foo/bar/g;   # Replace all "foo" w/ "bar"
But the find and replace strings can be as fancy as you want. They can also use groups. Here's a more complex example.
   # Matches y(anything)b and replaces the y with m 
   # and the b with n.
   # Anything in between is left alone because $2 
   # replicates it in the final string.  
   # Thus, xyzabc becomes xmzanc
   $string =~ s/(y)(.*)(b)/m$2n/;

TRY IT: Regexp Substitutions

Other Useful Flags

/s (at end) - Makes . wildcards match ANY character (including newlines)
    $string = "This is\nmy string.";
    
    # Won't match - dot wildcard doesn't include newlines.
    $string =~ /is.my/;

    # Will match - dot wildcard now includes newlines.
    $string =~ /is.my/s;
/x (at end) - Lets you use whitespace and comments to clarify regexps. Whitespace and comments are ignored in the actual matching (unless escaped).
    # First match something
    $string =~ /something

    # Then something else
    (something else)

    # Then a third thing
    [third thing]/x;

Split and Join

Split and join are two functions that use regexps. Split breaks up a string into an array, and join puts an array back into a string.
    @fields = split(/separator/, $string);
    $string = join(glue, @fields);
The split separator can be a regexp - as simple as a single delimeter, or as compex as you want. The join glue is just a string pasted between the elements.
    $string = "4:5:6";
    @fields = split(/:/, $string);
    $string = join(":", @fields);
Previous Chapter Next Chapter