Perl Regular Expressions

Emin Gabrielyan

2010-10-24

 

Perl Regular Expressions. 1

Shell pipelines. 1

Matching. 2

Substitution. 4

References. 7

 

Shell pipelines

 

Echo shell command prints its argument on the standard output. The “-e” option forces the command to interpret special symbols, such as “\n” for a new line.

 

$ echo "aaaa"

aaaa

 

$ echo "aaaa\nbbb\ncccc"

aaaa\nbbb\ncccc

 

$ echo -e "aaaa\nbbb\ncccc"

aaaa

bbb

cccc

 

$

 

The output of the command can be pipelined using the symbol “|” into an input of another command. The command cat prints on its standard output whatever is received as input.

 

$ echo -e "aaaa\nbbb\ncccc" | cat

aaaa

bbb

cccc

 

$

 

The following perl command does exactly the same as the command cat. It prints into the standard output the lines received at input, without modifying them. The option “-e” tells perl that the instructions (such as print) are provided in the command line. The option “-n” tells perl to repeat the instruction for each input line.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print'

aaaa

bbb

cccc

 

$

 

The print command of perl prints by default the input line. The input line is stored in the variable “$_”. Therefore “print” and “print $_” are two equivalent commands:

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print $_'

aaaa

bbb

cccc

 

$

 

Matching

 

The matching command “/…/” contains an expression that is searched in the default variable “$_”. The matching expression can be used in a if-statement. The command(s) following the if-statement are executed only for the lines that matched the expression “aa”.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'if(/aa/){print}'

aaaa

 

$

 

The perl language permits to rewrite the same if-statement in an inversed order if we deal with a single command:

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print if(/aa/)'

aaaa

 

$

 

The expression can be more complex than simply a sub-string. For example “[ab]” signifies either “a” or “b”:

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print if(/[ab]/)'

aaaa

bbb

 

$

 

The following reminds you that the “print” command above prints by default the variable “$_”.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print $_ if(/[ab]/)'

aaaa

bbb

 

$

 

The dot command concatenates two strings:

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print "line: ".$_ if(/[ab]/)'

line: aaaa

line: bbb

 

$

 

We use the same dot command to add a new line symbol “\n”. The variables “$_” already contained the new line symbol (provided in input strings), the reason why we see in the output of the command the empty lines (due to double new-line symbols).

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print "line: ".$_."\n" if(/[ab]/)'

line: aaaa

 

line: bbb

 

$

 

Here we learn a new special variable of perl “$&”. This variable contains the substring of the input line that is responsible for matching the regular expression. In this piece of code, the new line character “\n” is obligatory as the matched substring does not contain its own new-line character.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print "matched: ".$&."\n" if(/[ab]/)'

matched: a

matched: b

 

$

 

Here we match a substring containing “a” or “b” repeated 2 times. The expression “[ab]{2}” is equivalent of “[ab][ab]”.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print "matched: ".$&."\n" if(/[ab]{2}/)'

matched: aa

matched: bb

 

$

 

Now the same as above but with 3 characters:

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print "matched: ".$&."\n" if(/[ab]{3}/)'

matched: aaa

matched: bbb

 

$

 

When passing to 4 characters, the input line “bbb” does not match anymore (to “[ab]{4}” pattern).

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 'print "matched: ".$&."\n" if(/[ab]{4}/)'

matched: aaaa

 

$

 

 

Substitution

 

The “s” perl command substitutes a substring of the input line by a new substring. The printout below shows a substitution of the character “a” by the character “A”. Only one substitution per line is carried out.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/a/A/; print'

Aaaa

bbb

cccc

 

$

 

Below we substitute two characters “aa” by “A”:

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/a{2}/A/; print'

Aaa

bbb

cccc

 

$

 

Now we substitute any two consecutive characters of “a” or “b” by “K”.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/[ab]{2}/K/; print'

Kaa

Kb

Cccc

 

$

 

The character “^” indicates on the beginning of a line. This is not a character that exists in the input line and is used for referring to the beginning of the line. We add the character “K” at the beginning of each input line.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/^/K/; print'

Kaaaa

Kbbb

Kcccc

 

$

 

Similarly to “^” representing the beginning of the line, the character “$” appearing in the regular expression represents the end of the line.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/$/K/; print'

aaaaK

bbbK

ccccK

 

$

 

We use a pattern “[ab]” to indicate any character which is either “a” or “b”. If we want to indicate absolutely any character we can use dot “.” in the regular expression. The following piece of code replaces the last character of each string by “K”.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/.$/K/; print'

aaaK

bbK

cccK

 

$

 

The sequence of characters appearing in square brackets represents the list of possibilities that can match. For example “[ab]” means “a” or “b” and “[abc]” means “a”, “b”, or “c”. If the characters in the list are in the alphabetical order you can specify a range “[a-c”]. Therefore “[abc]” and “[a-c]” are two equivalent notations of the same expression.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/[a-c]/K/; print'

Kaaa

Kbb

Kccc

 

$

 

Match “a”, “b” or “c” two times and replace by “K” in each input line:

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/[a-c]{2}/K/; print'

Kaa

Kb

Kcc

 

$

 

You already learned that “{2}” or “{3}” are quantifiers and signify that the previous entity repeats 2 or 3 times respectively. The symbol “*” is also a quantifier and signifies that the previous entity can be repeated any times, from 0 to any number. As a result all strings matched, and replaced by a single symbol “K”.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/[a-c]*/K/; print'

K

K

K

 

$

 

The following has the same effect. The dot “.” meaning any symbol and the quantifier asterisk “*” meaning any time.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/.*/K/; print'

K

K

K

 

$

 

Now we match the entire string, and replace it by the special variable “$&” representing the substring being matched. It means we do nothing as we match a substring (or the entire string) and replace it by itself.

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/.*/$&/; print'

aaaa

bbb

cccc

 

$

 

The use of the variable becomes more interesting in the following example, where we duplicate the strings:

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/.*/$& $&/; print'

aaaa aaaa

bbb bbb

cccc cccc

 

$

 

Do not hesitate to triplicate the strings if you wish so:

 

$ echo -e "aaaa\nbbb\ncccc" | perl -ne 's/.*/$& $& $&/; print'

aaaa aaaa aaaa

bbb bbb bbb

cccc cccc cccc

 

$

 

 

References

 

What is a regular expression? [This document]

 http://switzernet.com/3/public/101024-regex/

 

Advanced Perl regular expressions

http://perldoc.perl.org/perlre.html

 

 

 

 

*   *   *

Copyright © 2010 by Switzernet