TruerWords Logo
Google
 
Web www.truerwords.net

Search TruerWords

Welcome
Sign Up  Log On

Showdown: string.patternMatch vs. re.match

A slightly different version of this message was originally posted to the frontierkernel developer's list. Now I've decided it's worth keeping around.


I added an optional startingAt parameter to string.patternMatch, in the Frontier kernel. (The lack of this parameter has bugged me since I first started using Frontier.) Without that parameter, string.patternMatch is restricted to finding only the first occurence of the pattern in the target string because it always starts looking from the first character. With the parameter, you can tell it to start anywhere in the string.

There were a few alternatives:

  • re.match, which uses regular expressions and is slightly more complicated, especially if you're searching for a charater with special meaning in regular expressions
  • manual looping through the string. Very straight forward, but insanely slow
  • chop-and-search. find the first hit with string.patternMatch, clip off the front of the string up to and including the matched character, and repeat. Ridiculously slow, mainly included for completeness. There are cases whre it's useful, but not when you're just searching through the string.

After I added the parameter, I (of course) ran some performance tests. Darn.

It's a LOT faster, there's no doubt about that. My assumption, though, was that it would also be somewhat faster than using re.match because of the extra complexity of the internals of the re (PCRE) verbs.

I was wrong.

The test script creates a 25 Kb string. 1023 'a' characters followed by a single 'b', repeated 25 times.

It extracts a list of index positions where 'b' occurs. There are three different tests: one for 'manual' looping through the string, one using string.patternMatch, and one using re.match. Each test ran 25 times so I'd have big enough numbers to compare. (The script also keeps the lists produced by each process, so I can compare them to be sure I was getting identical results).

manual loop 1710 ticks
string.patternMatch 7 ticks
re.match 8 ticks

(Um, wow. 200+ times faster for this particular test.)

A couple ticks in each test were used for updating the lists of indices. Take those out and the "kernelized" version is more like 300 times faster.

No matter how I changed the variables (number of loops, size of input string, frequency of the 'b'), string.patternMatch was always just marginally faster than re.match. If I tweaked the parameters just right I could make string.patternMatch almost twice as fast, but it was never anything greatly significant.

This wasn't just about performance, so it doesn't really matter that re.match is almost as fast. Anyone coming to Frontier from another language will probably find string.patternMatch a little frustrating, as I did, without the startingAt param, and they probably won't find re.match right away.

I'm still planning to check this in, but not right away. Some un-made decisions by the tree manager(s) (and a little consternation) are preventing me from checking in any new code this week.

Note: you can download my test script, but you won't be able to run it without the change to string.patternMatch().

Page last updated: 12/30/2004




TruerWords
is Seth Dillingham's
personal web site.
More than the sum of my parts.