[ Lexington.pm ]

Mik Firestone's Quick RegEx

Overview

The Perl regex is a broad topic with a lot of power and strange options. As I haven't had need to explore a fair amount of it, I will not talk about it. Further, everything I know I have learned either by reading the perlre or perlop documents or by reading O'Reilly's Mastering Regular Expressions by Jeffery Friedl. If anybody reading this article does not already own a copy of this book, I would strongly recommended purchasing it immediately.

This paper is split into two sections, local and universal modifiers.

Introductory example

To present a working example, consider trying to parse a configuration file where the basic layout is
    key = val
into a hash. To make it a bit more realistic, we will also allow a hash mark (#) to mark the beginning of a comment.

Yes, I could use require but that could cause an entire script to fail due to an improperly formatted file. This way, I can allow my users to edit configuration files in a way more natural to their thinking.

So to start, my first pass at doing something like this would look something like this:

    open FILE, "/some/path/to/some/file" or die "$!";
    while ( <FILE> ) {
        next if ( /^#/ );  # Skip comment lines
        s/#.*$//;          # Remove any trailing comments
        ( $key, @val ) = split /=/;     # Split the pair
        $key =~ s/^\s+//;               # Strip leading
        $key =~ s/\s+$//;               # and trailing spaces
        $val = join( ' ', @val );       
        $val =~ s/^\s+//;               # Strip leading but leave trailing
        $hash{$key} = $val;             # Save the value in a hash.
    }
    close( FILE );
In way of explanation, the reason I use an array in the second half of the split is in case the config file contains something like:
    key = (foo = bar) 

Local modifiers

I use the term local to indicate something that appears within the regex itself and modifies the behaviour of a small portion of the regex. To reiterate, there a lot of local modifiers I don't use out like look-aheads and look-behinds so I will not spend any time on them.

There are a number of commonly used local modifiers:

There are a variety of not so commonly used local modifiers that I have had occasion to use. One of my favourites is a conditional regex. The template looks like:

 (?(condition)yes-pattern)
Confusing, but consider trying to match telephone numbers where the area code may or may not be in (). If it is, you need to make the sure parentheses balance. We could use two regex - one to would for parentheses and the other wouldn't. Or, we could say this:
        The regex would be:
            m#(\()?           [ 0 or 1 open paren, saved ]
              \d\d\d          [ three digits ]
              (? (1)\)        [ if $1 is set, look for a closing paren ]
            )
Which, IMHO, is much more the Perl way.

Using what we have learned so far, we could write our file parser like:

            open FILE, "/path/to/some/file" or die "$!";
            while ( <FILE> ) {
                next if ( /^#/ );
                $hash{$1} = $2 if ( /^\s*(.+?)\s*=\s*([^#\n]+)$/ );
            }
            close( FILE );  # 9 seconds to parese same file
This will ignore incorrectly formatted lines but it will not remove whitespace from the end of the line. But notice we have reduced the number of lines it took and our Perl is begining to look like line noise.

Universal modifiers

A universal modifier exists outside of the regex and affects the behaviour of the entire regex. There are fewer of these than the local modifiers and I have personally used them more frequently. The only one I will not discuss is the 'c' modifier. Read your man pages for an explanation.

The most commonly used universal modifiers are:

There are three less commonly modifiers: Finally, I will cover one rarely used modifier, o. o - whenever you use varaibles in a regex, perl will recompile the regex everytime it is used. If the variable will not change in the lifetime of the program ( ie, the value is set after you have parsed an options file :) this can be very expensive. To avoid this, the o flag will cause the regex to be compiled only once. Using this little bit more knowledge, we can rewrite our file parser in a very simple and elegant fashion. To make this a little more understandable, I am clearing the input record seperator ( IRS ) so the entire file will be read into one string.
            $/ = "";  # unset IRS
            open FILE, "/some/path/to/some/file" or die "$!";
            $line = <FILE>;
            close( FILE );
            %hash = ( $line =~ /^\s*(.+?)\s*=\s*([^#]+?)$/mg );  

Notes

Since there seems to be a great deal of emphasis in the perl community on bench marks as a measure of correctness ( if it runs fast, it must be good ) I decided to run my three versions of the parser through the Benchmark module. To get reliable numbers, I ran 10,000 iterations of each method over a 13 line file. The result, in order of appearance were:
  1. The first try clocked in at 10 seconds,
  2. The second try clocked in at about 10 seconds as well.
  3. The third try clocked in a 5 seconds.
So, the final example is not only elegant perl but it is fast perl as well.

References

If you sysadmin has installed them, the online man pages for perl are an excellent if overwhelming resource. Everything I have discussed can be found in either perlre or perlop.

If you haven't purchased it already, I cannot recommned O'Reilly's Mastering Regular Expressions, by Jeffery Friedl, strongly enough. It is an excellent overview of Regular Expressions in general and the Perl regex engine in specific.

[ Lexington.pm ]