| |||||||||
|
| |||||||||
Perl Primer - Chapter 6 - Regular Expressionsby Linda Naughton
What are Regular Expressions?Regular expressions (regexps) are one of the most powerful tools in Perl.They are like their own mini programming language. Regular expressions define patterns for searching strings. A few examples of things you can do with regular expressions (by no means is this a complete list):
Regexp DefinitionsA regexp definition is generally encased within forward slashes:
/regexp/
For find/replace operations, you'll see two sets of slashes (one containing the "find" expression and the other containing the "replace" expression).
/regexp/regexp/
There are sometimes special key-letters ("flags") before or after the string to denote special actions. So a really complicate regexp may look like:
flags/regexp/regexp/flags
Technically you can use a delimeter other than the forward slash, but this is generally not done.
Common FlagsSome of the common flag actions are listed below, along with whether they appear at the beginning or end of the forward slashes:s = Substitution (beginning) s = Allows dot wildcard to match newlines (end) i = Case-insensitive matching (end) g = Global substitution (end) x = Allow whitespace for regexp commenting (end)These are described in more detail in their respective sections. Simple Pattern MatchingThe most common use of regexp is to identify whether a string matches a certain pattern using the "binding operator" (=~).The syntax is:
# Returns true if the string matches the pattern
# specified by the regexp
if ($string =~ /regexp/)
You can also do something if the string doesn't match the pattern.
if ($string !~ /regexp/)
Technically you can omit the binding operator, and the regexp will operate on $_. However, this practice is generally discouraged by our coding guidelines.
while (
Regular Expression BasicsYou could write an entire book on how to write regular expressions (and in fact there are several already). This just covers a few of the basics.The simplest regexp is just a string (without quotes):
if ($string =~ /foo/)
This will be true if any part of the string contains foo - even if it's part of another word.
The most common variable example is foo. # Matches
My food was tasty. # Matches
By default, matching is case-sensitive. You can make it insensitive by throwing the "i" flag on the end.
if ($string =~ /foo/i) # Case-insensitive matching
You can also use variables within regexps.
# Match against whatever's in pattern
if ($string =~ /$pattern/)
TRY IT: Regular Expression Basics
WildcardsLots of wildcards exist. These operate only on a single character. Note: This is not a complete list of wildcards, just some of the common ones.
. = Matches any single character (except newlines).
\d = Matches a digit (0-9)
\w = Matches an alphanumeric "word" character (a-z, A-Z, 0-9, underscore)
\s = Matches a whitespace character (space, \n, \t, etc.)
Some examples:
/foo\sbar/ # Matches foo bar or foo\nbar
/foo\d/ # Matches foo1, foo2, foo3, etc.
/\wfoo/ # Matches afoo, bfoo, etc.
/foo./ # Matches foo1, foo_, etc.
Most of the letter-based wildcards have mirror images if capitalized. So \D will match a NON-digit character, \S will match a NON-whitespace character, etc.
Note that the * is not a wildcard! It has other meaning, explained below. TRY IT: Regexp Wildcards QuantifiersThe wildcards match only single characters. So what do you do if you want to match a bunch in a row?One option is to just repeat the wildcard:
/\w\w/ # Matches 2 word characters
Another option is to use quantifiers. The quantifiers are:
+ = Matches ONE OR MORE of the specified char
* = Matches ZERO OR MORE of the specified char
? = Matches ZERO OR ONE of the specified char
The most common use of the * quantifier is to match "any old junk":
/foo.*bar/ # Matches foo
Some other examples:
# Matches foo bar or foo\nbar but NOT
# foobar (since 1 or more spaces req'd).
/foo\s+bar/
# Matches foo bar or foo\nbar or
# foobar (since 0 or more spaces req'd).
/foo\s*bar/
# Matches foo bar or foobar but NOT
# foo bar (since only 1 space is permitted)
/foo\s?bar/
Quantifiers can also be used on regular letters in addition to wildcards.
/fo*bar/ # Matches fbar, foobar, fooobar, etc.
Be careful with the * quantifier, since it will happily match even when nothing's there.
/f*/ # Matches anything, since the f is optional
Lastly, you can match something a specified number of times using curly braces.
/a{5,15}/ # Matches aaaaa, aaaaaa, up to 15 a's.
TRY IT: Regexp Quantifiers Escaping Special CharactersIf you want to match an actual period, backslash, asterix, comma, etc. in a regexp you need to escape it using a backslash.
/AB\.DE/; # Matches AB.DE
/AB\\DE/; # Matches AB\DE
ClassesUse square brackets to define a "character class", which will match any character in the specified list. The \w wildcard, for example, is just a shortcut for a character class:/[a-z0-9]/ # Same as \wYou can match whatever characters you want in your class. /[a-dnw-z]/ # Matches a, b, c, d, n, w, x, y or zRemember that these are case-sensitive unless you use the Ôi' flag. You could also explicitly make it case-insensitive by including both cases:
/[a-ZA-Z]/ # Case-insensitive match
TRY IT: Regexp Character Classes GroupsUse parens to create "groups" within regexps.This can be used to apply quantifiers to entire words.
/(fred)+/ # Matches fred, fredfred, etc.
You can also use them to match alternatives.
/(foo|bar)/ # Matches foo OR bar
TRY IT: Regexp Groups AnchorsAnchors can be used when you want to match whole words, or whole phrases. Common anchors:
^ - Beginning of string
$ - End of string (with or without newline)
\b - End of word
# Match fred but not Frederick or Alfred
$string =~ /\bfred\b/;
# Match my phrase, as long as it's the only
# thing in the string.
$string =~ /^my phrase$/;
Note: "End of word" means a non-word (\w) character. So apostrophes, quotes, and hyphens will be considered word boundaries.
TRY IT: Regexp Anchors Regexp MemoryRegexps "remember" things that match groups. These are automatically placed into the implicit variables $1 (for the first group), $2 (for the second group), etc.
# If the string was "fred wilma",
# $1 would store fred and $2 wilma
# If the string was "barney wilma",
# $1 would store barney and $2 wilma
# If the string was "harvey jane",
# there is no match and $1/$2 would be
# unreliable
if ($string =~ /(fred|barney)\s+(wilma|betty))
You can assign the regexp memory to your own variables because the regexp evaluation returns a list containing $1, $2, etc.
if (($husband, $wife) =
($string =~ /(fred|barney)\s+(wilma|betty)))
Note: The regexp memory is persistent until the next successful match. So don't attempt to use $1, $2, etc. if the match failed. They are not necessarily undef!
TRY IT: Regexp Memory Substitutions (find/replace)Another common use of regexps is to perform find/replace actions. This requires the "s" flag (for "substitution") in front of the first forward slash.
$string =~ s/regexp/regexp/;
The simple case is to just use a string to find and a string to replace.
$string =~ s/foo/bar/; # Replaces "foo" with "bar"
By default, a substitution will only replace the first instance. To replace all of them, use the "g" flag.
$string =~ s/foo/bar/g; # Replace all "foo" w/ "bar"
But the find and replace strings can be as fancy as you want. They can also use groups. Here's a more complex example.
# Matches y(anything)b and replaces the y with m # and the b with n. # Anything in between is left alone because $2 # replicates it in the final string. # Thus, xyzabc becomes xmzanc $string =~ s/(y)(.*)(b)/m$2n/; TRY IT: Regexp Substitutions Other Useful Flags/s (at end) - Makes . wildcards match ANY character (including newlines)
$string = "This is\nmy string.";
# Won't match - dot wildcard doesn't include newlines.
$string =~ /is.my/;
# Will match - dot wildcard now includes newlines.
$string =~ /is.my/s;
/x (at end) - Lets you use whitespace and comments to clarify regexps. Whitespace and comments are ignored in the actual matching (unless escaped).
# First match something
$string =~ /something
# Then something else
(something else)
# Then a third thing
[third thing]/x;
Split and JoinSplit and join are two functions that use regexps. Split breaks up a string into an array, and join puts an array back into a string.
@fields = split(/separator/, $string);
$string = join(glue, @fields);
The split separator can be a regexp - as simple as a single delimeter, or as compex as you want. The join glue is just a string pasted between the elements.
$string = "4:5:6";
@fields = split(/:/, $string);
$string = join(":", @fields);
|
![]()
| ||||||||