Tutorial - Making a Regex-powered parser

Making regex parsers (mIRC)

Rico Sta. Cruz (enfusion), March 2005

(This is an advanced article. I assume you know regex and have done lots of coding.)

Have you ever tried making a parser yourself? Studying the string-parsing routines, then typing up long spaghetti code for hours, and finally coming up with a hideous soup of source-code which you're sure to not understand the next day? Then time comes that your project needs to expand, your parser needs to parse new tokens: and you end up trashing your mumbo-jumbo code away, only to re-write something twice as complex and twice as incomprehensible?

Insanely tedious and impractical, right?

Fear not! Whenever there are tasks to be done by developers that are rigorous and mundane, (like for instance, making a parser), chances are other developers have already made solutions to simplify your woes. This is, afterall, how the computer programming world progresses -- by abstracting and simplifying things.

There are parser-generators out there to do the job of making lexical- analysis code. Two of the most notable (in my opinion) are bison and yacc, both of which generate C code to parse files based on a grammar rule-set you make. I suggest you go look into these two wonderful projects, even if you don't plan on using them -- just to see how they work.

But alas, there is no such tool for mIRC! That's alright, we'll just have to make-do with what we have, which actually works quite similarly to such parser tools I mentioned.

Enter regular expressions.

Advantage of regex-powered parsers with normal parsers

Interestingly enough, these are exactly what each and every parser code needs to be:

Faster, usually. (See next section)
Much much more manageable.
Scalable and extensible.

Speed of Regex

At first, I doubted the speed of a regex-based parser. Regex rule strings need to be parsed, afterwhich will be juggled around by hundreds of lines of internal code. At first, this sounded like a massive overhead.

I gave this some thought and came up with the following hypotheses:

mIRC's scripting engine isn't quite the fastest out there. If you plan on writing your own parser that navigates through a file with /fOpen + $fRead then crunch each line with $pos, $gettok, $mid, etc, mIRC's scripting engine will be wading through your entire script and evaluating each and every line of code you wrote. mIRC's regex parser is built from optimized C (?) code which clearly should outperform a sluggish scripting engine trying to parse custom lines of code one-by-one.
Regex strings are cached by mIRC. Most (all?) regex engines "compile" regex strings before using them -- cache'ing compiled regex strings will increase performance.

Complexity of regex

Here's an example of a regex string for a parser tham I wrote (that prompted me to write this article to share my experience with it :):

^\s*(?:(\w+)\s*:\s*)?(\w+(?:/\w+)*)(?:\s*\.\s*(\w+))?\s*=\s*(?:(?!\s*{$)(.+))$

"Holy symbols, batman! Just how can anyone possibly understand, manage, and even extend a line that looks like a retarded hacker's leetspeak?," you ask.

That's what this article is about. We're going to break this down into manageable, human-readable chunks.

Case study

The mIRC script I'm working on has files with a data structure that's akin to XML (tree structure). Let's make a parser for it. Here's a sample of the format we're gonna be working with:

away/ {

  ; A "key = value" pair
  logging = on
  defaultMessage = I'm flying away...

  ; A folder (designated by the / postfix)
  font/ {
    name = Franklin Gothic Medium
    size = 14
  }
}

; (Note that the first half of this article doesn't cover this next part.
; We'll tackled it in the 'Expanding' section.)

dialog/ {

  ; With an attribute -- key.attribute = value
  ok.width = 40
  ok.height = 100
  ok.text = Okay! Alright!
  
  ; With a type -- type : key = value  
  editbox : myEdit = foof;
  
  ; A folder with a type  
  button: cancel/ {
    .width = 40
    .height = 90
    .text = Uh oh. Cancel.
  }
}

For now, lets concentrate on the first part. We're gonna write a parser that treats this line-by-line. Traditionally (i.e., the method that I vohemently abhor), you might code one like this:

  %t = $fRead(...file,..)
  if ((*/ isWm %t) && ($chr(125) == $gettok(%t,-1,32))) {
    ..folder that has stuff (e.g., "font/")
    some more long code go here
    plus a bunch of nested if statements
  }
  elseif ((= isin %t) || ($left() %t) && (etc etc)) {
    .. a key = value pair (e.g., "logging = on")
    more tangled code go here
  }

Ugly. Lets do it this way.

The regex way: a prelude

We have 2 types of lines, a "folder" line and a "key=value" line. We know that they are comprised of:

  folderLine := folderName whitespaces '/' whitespaces '{'
  keyLine := keyName whiteSpaces '=' whiteSpaces value

Now let's define the rest of the tokens:

  whiteSpaces := '\s*'
  (That's regex for one-or-more spaces/tabs)

  folderName := name
  keyName := name
  value := .+
  (Regex for 'match any string')

  name = '\w+'
  (One or more alpha-numeric characters)

Let's make a compiler for these kind of lines. Note that this compiler is only going to be used by you, and is merely temporary -- it's only purpose is to generate the regex strings we'll be using for our final parser.

Now lets turn those lines into mIRC lines:

alias regexFoo {
  var %ws = \s*
  var %name = \w+
   
  var %folderName = %name
  var %keyName = %name
  var %value = .+
    
  var %folderLine = %folderName %ws / %ws $chr(123)
  var %keyLine = %keyName %ws = %ws %value
  
  ; Add ^ and $ to signify that we're matching one whole line
  %folderLine = ^ $+ $remove(%folderLine,$chr(32)) $+ $
  %keyLine = ^ $+ $remove(%keyLine,$chr(32)) $+ $
  
  ; Show the result
  echo -s * folderLine := %folderLine
  echo -s * keyLine := %keyLine
}

Note that i'm using $remove(...,$chr(32)) to remove the spaces. That's because $+ is ugly and I don't want to use it, and we won't be using spaces anyway (our whitespaces are \s). Running the alias (our little "compiler"), we get:

* folderLine := ^\w+\s*/\s*{$
* keyLine := ^\w+\s*=\s*.+$

Testing our the keyLine regex expression to various strings with $regex, we get:

Test string	Result
cowSound = moo	OK
cowSound=moo	OK
cowSound = moo	OK
oink	Fail
= oink	Fail
pigSound =	Fail

Perfect!

Expanding

Time to expand. We got a parser to tackle key=value pairs. But what about keys with attributes, i.e.: key.attribute = value?

Let's modify our code, mainly keyLine:

; Follow idName and folderName.
var %attribName = %name
...
; Wrap the attribute string in "(" and ")?"
; "?" means "match zero or one of these"
; Also note that the . is escaped with a backslash since the dot is a
; special regex character.
var %keyLine = %keyName ( %ws \. %ws %attribName )? %ws = %ws %value

We include ( and )? to signify that it's optional. But the OC in me cringes at the sight of regex code that seems to be camouflaging with our string -- so, lets improve on this and make it more semantically nice. We are, afterall, going for manageable and readable code:

; Look! I'm using "(?: ... )?" instead of just "()?".
; Notice the extra "?:". What does this mean? Nothing.
; ...Really. More on this later. Ignore it for now.
alias -l OPTIONAL return (?: $+ $1- $+ )?
...
...
var %keyLine = %keyName  $OPTIONAL( %ws \. %ws %attribName )  %ws = %ws %value
...

Perfect. We just added attribute! Now for the optional type.
(i.e.,[type:]key[.attribute] = value)

...
; Follow idName and attribName.
var %typeName = %name
var %typeNameOptional = $OPTIONAL( %typeName %ws : %ws )
...
var %keyLine = %typeNameOptional %keyName  $OPTIONAL( %ws \. %ws %attribName )  %ws = %ws %value
...

Here's our assembled code so far: (blue lines are the ones we added)

alias -l OPTIONAL   return (?: $+ $1- $+ )?
alias regexFoo {
  var %ws = \s*
  var %name = \w+
   
  var %folderName = %name
  var %keyName = %name
  var %attribName = %name
  var %typeName = %name
  var %value = .+

  var %typeNameOptional = $OPTIONAL( %typeName %ws : %ws )
    
  var %folderLine = %folderName %ws / %ws $chr(123)
  var %keyLine = %typeNameOptional %keyName  $OPTIONAL( %ws \. %ws %attribName )  %ws = %ws %value

  %folderLine = ^ $+ $remove(%folderLine,$chr(32)) $+ $
  %keyLine = ^ $+ $remove(%keyLine,$chr(32)) $+ $
  
  echo -s * folderLine := %folderLine
  echo -s * keyLine := %keyLine
}

Time to "compile".

* folderLine := \w+\s*/\s*{
* keyLine := ^(\w+\s*:\s*)?\w+(\s*\.\s*\w+)?\s*=\s*.+$

Time to test keyLine. (Lets not worry about folderLine for now)

Test string	Result
animal: cow = moo	OK
animal:cow=moo	OK
cow.sound = moo	OK
cow . sound =moo	OK
animal:cow.sound = moo	OK
animal: pig =	Fail
animal: = oink	Fail
animal: .sound = baa	Fail

Perfect! Just as expected.

$regml

Good, we got it to parse 'em strings. Now how are we going to fish out the parts we need? (i.e., the type, key, attribute, and value)

mIRC has $regml to find out what we matched. It does this by simply returning the exact string that matched a part of the expression enclosed in parentheses.

Tip!

Parentheses is pronounced as paREN-thee-seez. Same with crises, nemeses, and the rest of the is->es plurals.

Back to mIRC scripting: lets modify our %folderName, %typeName, %keyName, %attribName, %value variables:

alias -l SIGNIF   return ( $+ $1- $+ )
...
  var %folderName = $SIGNIF( %name )
  var %keyName    = $SIGNIF( %name )
  var %attribName = $SIGNIF( %name )
  var %typeName   = $SIGNIF( %name )
  var %value      = $SIGNIF( .+ )
...

...With the little $SIGNIF alias there to add aid in semantics, hehe.

Now, why did we have the extra ?: in $OPTIONAL? This is so that $regml will skip that group. (This doesn't affect $regex's parsing.)

Our final (not really) code and testing

Here we go:

alias -l OPTIONAL   return (?: $+ $1- $+ )?
alias -l SIGNIF   return ( $+ $1- $+ )

alias regexFoo {
  var %ws = \s*
  var %name = \w+
   
  var %folderName = $SIGNIF( %name )
  var %keyName    = $SIGNIF( %name )
  var %attribName = $SIGNIF( %name )
  var %typeName   = $SIGNIF( %name )
  var %value      = $SIGNIF( .+ )

  var %typeNameOptional = $OPTIONAL( %typeName %ws : %ws )
    
  var %folderLine = %folderName %ws / %ws $chr(123)
  var %keyLine = %typeNameOptional %keyName  $OPTIONAL( %ws \. %ws %attribName )  %ws = %ws %value

  %folderLine = ^ $+ $remove(%folderLine,$chr(32)) $+ $
  %keyLine = ^ $+ $remove(%keyLine,$chr(32)) $+ $
  
  echo -s * folderLine := %folderLine
  echo -s * keyLine := %keyLine
}

Output:

* folderLine := (\w+)\s*/\s*{
* keyLine := ^(?:(\w+)\s*:\s*)?(\w+)(?:\s*\.\s*(\w+))?\s*=\s*(.+)$

Further testing. This is done by:

; The result of this is > 0 if it runs right
$regex(KEYLINE, string, keyLine)

; Loop through these results
$regml(KEYLINE, number)

The results:

Test string	Results
animal: cow .sound = moo	1 [animal], 2 [cow], 3 [sound], 4 [moo]
cowSound = moo	1 [cowSound], 2 [moo]
bird.wings = 2	1 [bird], 2 [wings], 3 [2]

What's next?

Now use those regex strings, make your own parser based on /fOpen, $fRead, and the rest of the f* functions. (Don't use $read -- it's terribly slow)

You won't need to package the compiler with your script, heh. Keep it to yourself, then when you need to revise your format, edit your compiler's code and generate a new regex string.

Contact

Here: enf $chr(64) digitelone.com