TruerWords Logo
Google
 
Web www.truerwords.net

Search TruerWords

Welcome
Sign Up  Log On

Building a Codeless Language Module with BBEdit 8.5 and (Ir-)Regular Expressions

Table of Contents

Introduction: Irregular Expressions?

Some of my work with Bare Bones and BBEdit 8.5 (see the disclaimer) involved a lot of regular expressions. Version 8.5 now supports "Codeless Language Modules" with regular expressions for identifying significant parts of a file (strings, functions, comments). I helped a few people produce these modules for their own use. For example, there was one for a home-grown language that is intended to be used with "Getting Things Done" task lists, and another was for use with newLISP files.

The challenge with the newLISP language was to write a regular expression that could identify functions, the function's name, comments, and strings. Anybody who's ever seen any lisp code knows this means (among other things) looking at a lot of parentheses.

In fact, this is a task that true regular expressions aren't up to. Regular expressions don't do matched, balanced sets of characters like open and close parentheses, where the contents of the matched set is allowed to contain arbitrarily nested pairs of the same.

In other words, true regular expressions have no way to specify that a pair of parentheses can contain any number of additional pairs of parentheses, which themselves can contain pairs of parens, etc., etc., etc. If the nesting depth is limited, then there's no problem... but if you can't know ahead of time how deep the parens will be nested then true "regular expressions" are simply not an option.

In walks PCRE 5. My first experience with it was in BBEdit 8.5, but it's actually supported in quite a few applications.

PCRE stands for Perl Compatible Regular Expressions. That's kind of funny, when you think about it. First, the particular application we're considering here (finding arbitrarily nested, matched sets) can not be implemented with true regular expressions. Second, the extensions that PCRE adds to the regular expression grammar to make this possible is not perl compatible! Wacky.

Writing the newLISP Patterns

To demonstrate the flexibility of these crazy new features in PCRE 5, let's build a newLISP function-matching pattern. We'll build it up in separate pieces and then assemble it all when we're done.

Here's a sample LISP function:

(define (rad-deg r)
    ; convert radians to degrees
    (mul r (div 180 pi)))

We need patterns to describe the following:

  • a complete function
  • the name of the function
  • comments
  • strings (not used in the sample above)

Comments

Let's start with comments. According to the information I've dug up, newLISP comments are delimited as follows:

  • Anything after a semi-colon or #, until the end of the current line.
    Pattern: [;#].*$
  • Anything after "#|", until either a "|#" or the end of the file.
    Pattern: #\|(?s:.*?)(?>\|#|\z)

Using extended syntax for legibility, we get the following pattern:

(?x:
    (?>  [;\#]       .*          $            )
)

The # are escaped as \# because the # starts a comment in regular expression "extended syntax".

Change note: previous versions of this pattern included an inline-comment form that looked like "#| comment |#", which could span multiple lines. This was based on something I saw in vim's patterns, but I now see that the language doesn't actually support it, so I've removed it.

Strings

Strings are handled in a very similar fashion to comments, but with different delimiters.

  • Double quotes start and end strings, but the string can contain a double quote if it is immediately preceded by a backslash (so, quotes can be "escaped").
    Pattern: "(?>\\.|[^"])*?(?>"|\z)
  • An open curly brace starts a string, and a close curly brace ends it. The string can contain a close curly brace if it is escaped with a backslash.
    Pattern: \{(?s:\\.|[^}])*?(?>}|\z)
  • Strings can also start with "[text]" and end with "[/text]". These are known as tagged strings.
    Pattern: \[text\](?s:.*?)(?>\[/text\]|\z )

We'll combine those into a single pattern like we did with comments:

(?>
    (?>  "           (?s: \\. | [^"] )*?     (?> "         | \z )     ) |
    
    (?>  \{          (?s: \\. | [^}] )*?     (?> \}        | \z )     ) |
    
    (?>  \[text\]    (?s:.*?)                (?> \[/text\] | \z )     )
)

Function Names

Function names are quite easy, also. Just look for the word define, then an opening parenthesis. Immediately following the opening paren should be one or more characters (of a particular set). Those characters will be the name.

Pattern: define[ ]+\(([!$%&*+-./:<=>?@^~0-9A-Z_a-z]+)

I've hilited the group that defines the name itself. If you were doing an extraction or find-and-replace operation with JUST that pattern, the function name would be at \1 (or $1 in perl).

The Whole Function

The (pseudo-code version of the) rules for matching an entire function will look something like this:

  • Open with an open parenthesis, followed by define, followed by another opening praenthesis and the name of the function
  • Optional: a mixture of parameters (words, where each word comes from the same character set as that used for the function name) and blank spaces
  • A closing parenthesis
  • The function body, which is one or more of any or all of the following:
    • Any number (zero or more) of lisp statements, each of which are enclosed in parentheses and may contain other statements. This is where we run into the nested parentheses problem that we can't solve with true regular expressions.
    • Bare words which are not actually enclosed in additional parentheses
    • Any number of strings, inside statements or by themselves.
    • Any number of comments.
  • A final, closing parenthesis.

Matched Pairs of Parens

Before showing the whole thing, let's look at a simple example that just demonstrates matched pairs of parentheses. This requires PCRE's "named groups" feature.

(?x:
    (?P<parens>
        \(
            (?>
                (?> [^()]+ ) |
                
                (?P>parens)
            )*
        \)
    )
)

The second line creates the named group "parens". (The first line just switched on 'extended mode', to make it easier for us to read the pattern.)

Now, named groups by themselves wouldn't be a very big deal. Useful, in that they allow us to refer to capturing groups by name rather than number, but not really enabling of anything new.

What really matters is that eighth line, (?P>parens). It's a back reference to the group that contains the back reference. It's recursive!

So, this pattern finds an opening parenthesis with a matching close parenthesis. Between them can appear any number of non-parentheses, or more pairs of parentheses (which follow the same rules).

Now lets look at the full function-detecting pattern for newLISP files.

Put It All Together

We have patterns for strings, comments, function names, and nested parentheses. Now we can put it all together and create a single pattern that will match each function in the newLISP file.

(?x:
  ^
  [ ]*
  (?P<function>
    \(
      define
      
      [ ]+
  
      \(
        (?P<function_name>
          (?P<identifier>
            [!$%&*+-./:<=>?@^~0-9A-Z_a-z]+
          )
        )
        
        [ ]*
        
        (?>
          [ ]*
          
          (?P>identifier)
        )*
      \)
  
      \s*
      
      (?P<function_body>
        (?>
          (?P<plain_text>
            (?> [^\[";\#()]+  ) |
            
            (?> \# (?!\|)     ) |
            
            (?> \[ (?!text\]) )
          ) |
          
          (?P<comment>
            (?> ;         .*                    $                     ) |
            
            (?> \#        .*                    $                     ) |
            
            (?> \#\|      (?s:.*?)              (?: \|\#      | \z )  )
          ) |
          
          (?P<string>
            (?> "         (?s: \\. | [^"] )*?   (?> "         | \z )  ) |
            
            (?> \{        (?s: \\. | [^}] )*?   (?> \}        | \z )  ) |
            
            (?> \[text\]  (?s: .*? )            (?> \[/text\] | \z )  )
          ) |
          
          (?P<parens>
            \(
              (?>
                (?P>plain_text) |
                
                (?P>comment) |
                
                (?P>string) |
                
                (?P>parens)
              )*
              
              \s*
            \)
          ) |
          
          \s*
        )*
      )
    \)
  )
)

Yes, I know that's a long pattern. It shouldn't be too difficult to understand as long as you consider it it piece by piece. Note that the function and function_name "named groups" (near the top) are required in order for function scanning to work properly.

Patterns to .plists

If you open a .lsp file in BBEdit, then run a Grep search with the above search pattern, you should find that it selects the first function in the file. Command-G will take you to each successive pattern in the file. It works!

Creating a language module means creating a .plist file with all the information needed by BBEdit to do syntax coloring and function "scanning" for your language. You need to name the language, specify the file extension(s), list all of the language's keywords, and provide patterns separate for comments, strings, and functions.

Since the patterns for comments and strings are needed separately from the function pattern (for syntax coloring), BBEdit lets us omit the named groups for strings and comments from our function pattern, but still use the names in the pattern. So, the final pattern will be a little shorter than what you see above. (Unfortunately, when testing with BBEdit's search, you still need everything in the pattern or you'll get an error.)

When you copy your patterns into the plist file, remember:

  • If you're using the XML plist format, wrap each pattern with <![CDATA[ delimiters ]]>
  • If you're using the ASCII plist format, search for every backslash "\" and replace with a double backslash "\\".

Downloads

You can download a copy of the final newLISP.plist Codeless Language Module in the XML PLIST format (updated 8/8/2007). There's also a version in ASCII Plist format (updated 8/8/2007).

If you'd like to write a Codeless Language Module for another language, I've also created a CLM template file with placeholders and basic instructions. If you prefer Apple's ASCII Plist format, get this one instead.

Updates / Changes

  • Monday, June 16, 2008
    • Fixed a bug in the guide which said that newLISP had inline comment delimiters. It doesn't, and the syntax for them have been removed from the patterns.

    • Fixed the same bug in the downloadable CLM. There is no such thing as inline comments in newLISP.

    • Strings are delimited with double quotes, curly braces, or [text][/text] tags. The guide always had this right, but the CLM only supported double quotes.

    • Some of these bugs had been fixed in my copy (on my computer) for at least a year now, but were never updated here on the site.

  • Wednesday, August 8, 2007
    • Fixed a bug in the function pattern. The example (under Matched Pairs of Parens) shows that the contents of the parentheses can be "zero or more", but the full function-matching pattern used the one-or-more modifier (+ instead of *). The result was that functions which took no parameters were skipped (no entry in the function popup, no fold).

    • Updated the downloadable language module with the above change.

    • Updated the downloadable language module with new keywords for newLISP 9.2

Page last updated: 6/23/2008




TruerWords
is Seth Dillingham's
personal web site.
Truer words were never spoken.