TruerWords Logo
Google
 
Web www.truerwords.net

Search TruerWords

Welcome
Sign Up  Log On

“regular vs. context-free grammars”

From: Brian Andresen In Response To: Top of Thread.  
Date Posted: Friday, September 22, 2006 11:20:20 AM Replies: 0
   
Enclosures: None.

Yesterday Seth and I were discussing regular expressions, and the extensions made to regexes by Perl/PCRE, Python, etc. The particular extension was named groups (search for "(?P<" in either of those documents to find the related section) and how they can be used to construct a recursive-match expression.

Most people who know what a regular expression is, probably think of it as a pattern with a particular syntax that useful for matching. From a practical standpoint, good enough. However there's a theoretical background to this — computer-science theory has a concept of a regular language, and for any regular language there's a regular expression that exactly describes (or generates) it. Conversely, any regular expression describes a regular language... until extensions are added in. Now it's possible to make a "regular expression" that no longer describes a regular language.

Why should anyone care? I think the theoretical distinctions are valuable. Most of my information on this subject comes from years poring over, and trying to understand, the "dragon" book, but I can't dredge out diagrams from those pages. Fortunately, Wikipedia's coverage actually looks to be pretty good. (I haven't done any careful review for technical correctness on all points, but then, even if I did I'm probably not smart enough to catch those kind of errors anyway.)

regular languages — See the introductory part, primarily.

regular grammars — Skip the introductory part but see the section labeled "Introduction". Also, the example (which is used in nearly all these pages) of "ai bi", "i copies of symbol 'a' followed by i copies of symbol 'b'", is exactly the problem of counting matched parentheses — substitute 'a' with '(' and 'b' with ')'.

context-free grammars — See the introductory part and maybe the "Properties of context-free languages" section.

formal grammars — See the introductory part, and the sections on "Generative grammars", The Chomsky hierarchy", and "Analytic grammars".

To me, theory is important because it provides a guide for what to implement, and confidence that the implementation will work for all relevant cases. The key thing about regular grammars, in my opinion, is that they can be recognized by a DFA which has a definite model for implementation, and analysis. The trick is figuring out how many states there are, and figuring the state transition tables.

Someone figured out that context-free grammars can be parsed using a similar pattern — a set of states, along with state transition rules — but a stack of states is needed. This is how yacc and bison work. In contrast, a DFA needs to track only one state, and the current state describes exactly where the parser is at and what it's expecting (or, able to recognize next). CFG parsers are more complex, obviously — and in many situations that complexity is very useful, i.e. exactly what's needed.

What bugs me about the regex extensions (I understand their pragmatic value, I'm bugged because of the murkiness at the theoretical level) is that the parser packages can no longer use a DFA representation. It's not clear what model they would use, and in my mind that opens doors for complications because maybe the behavior of a certain expression isn't well defined. I don't even know of any such example, but I'm bugged by the possibility. :-) So where a pure-regex package can be well-understood and faster, a regex+extensions package must have a muddied implementation, and is probably slower. Yet it's not even as capable as a context-free parser — which is a category that's been analyzed for years and is well-understood (by people other than me ;-) at a theoretical level.


Incidentally, some of the above pages contain references to a parsing expression grammar which is lauded as "a good replacement for regular expressions, because they are strictly more powerful. For example, a regular expression inherently cannot find matched pairs of parentheses, because it is not recursive, but a PEG can." I'm not sure whether regex+extensions recognizes the same languages that a PEG does... I'm also not sure that a general PEG parser algorithm can be implemented as efficiently. For sure it will be less efficient than a regular grammar parser, and (based on the description) I'm not sure that it can even be as optimized as much as a CFG parser.


I thought of another situation where understanding CFG vs. regular grammars can be useful: when designing a language. Suppose I have four simulators running on a quad-core machine, and they need to chat with each other to pass control messages and status updates. If I design a regular language, then I know I can use a parsing mechanism that's faster and uses less memory. If I don't know, then I'm probably going to write a parser that's significantly less efficient.

-brian


Discussion Thread:

There are no replies.

Trackbacks:

There are no trackbacks.


Until August 31
My Amazon sales
benefit the PMC

Homepage Links

Apr 1 - Aug 31
Ad revenue
benefits the PMC


TruerWords
is Seth Dillingham's
personal web site.
More than the sum of my parts.