(This is an advanced article. I assume you know regex and have done lots of coding.)
Have you ever tried making a parser yourself? Studying the string-parsing routines, then typing up long spaghetti code for hours, and finally coming up with a hideous soup of source-code which you're sure to not understand the next day? Then time comes that your project needs to expand, your parser needs to parse new tokens: and you end up trashing your mumbo-jumbo code away, only to re-write something twice as complex and twice as incomprehensible?
Insanely tedious and impractical, right?
Fear not! Whenever there are tasks to be done by developers that are rigorous and mundane, (like for instance, making a parser), chances are other developers have already made solutions to simplify your woes. This is, afterall, how the computer programming world progresses -- by abstracting and simplifying things.
There are parser-generators out there to do the job of making lexical- analysis code. Two of the most notable (in my opinion) are bison and yacc, both of which generate C code to parse files based on a grammar rule-set you make. I suggest you go look into these two wonderful projects, even if you don't plan on using them -- just to see how they work.
But alas, there is no such tool for mIRC! That's alright, we'll just have to make-do with what we have, which actually works quite similarly to such parser tools I mentioned.
Enter regular expressions.
Interestingly enough, these are exactly what each and every parser code needs to be:
At first, I doubted the speed of a regex-based parser. Regex rule strings need to be parsed, afterwhich will be juggled around by hundreds of lines of internal code. At first, this sounded like a massive overhead.
I gave this some thought and came up with the following hypotheses:
/fOpen
+ $fRead
then crunch each line with
$pos
, $gettok
, $mid
, etc, mIRC's
scripting engine will be wading through your entire script and evaluating each
and every line of code you wrote. mIRC's regex parser is built from optimized C
(?) code which clearly should outperform a sluggish scripting engine trying to
parse custom lines of code one-by-one.Here's an example of a regex string for a parser tham I wrote (that prompted me to write this article to share my experience with it :):
^\s*(?:(\w+)\s*:\s*)?(\w+(?:/\w+)*)(?:\s*\.\s*(\w+))?\s*=\s*(?:(?!\s*{$)(.+))$
"Holy symbols, batman! Just how can anyone possibly understand, manage, and even extend a line that looks like a retarded hacker's leetspeak?," you ask.
That's what this article is about. We're going to break this down into manageable, human-readable chunks.
The mIRC script I'm working on has files with a data structure that's akin to XML (tree structure). Let's make a parser for it. Here's a sample of the format we're gonna be working with:
away/ { ; A "key = value" pair logging = on defaultMessage = I'm flying away... ; A folder (designated by the / postfix) font/ { name = Franklin Gothic Medium size = 14 } } ; (Note that the first half of this article doesn't cover this next part. ; We'll tackled it in the 'Expanding' section.) dialog/ { ; With an attribute -- key.attribute = value ok.width = 40 ok.height = 100 ok.text = Okay! Alright! ; With a type -- type : key = value editbox : myEdit = foof; ; A folder with a type button: cancel/ { .width = 40 .height = 90 .text = Uh oh. Cancel. } }
For now, lets concentrate on the first part. We're gonna write a parser that treats this line-by-line. Traditionally (i.e., the method that I vohemently abhor), you might code one like this:
%t = $fRead(...file,..) if ((*/ isWm %t) && ($chr(125) == $gettok(%t,-1,32))) { ..folder that has stuff (e.g., "font/") some more long code go here plus a bunch of nested if statements } elseif ((= isin %t) || ($left() %t) && (etc etc)) { .. a key = value pair (e.g., "logging = on") more tangled code go here }
Ugly. Lets do it this way.
We have 2 types of lines, a "folder" line and a "key=value" line. We know that they are comprised of:
folderLine := folderName whitespaces '/' whitespaces '{' keyLine := keyName whiteSpaces '=' whiteSpaces value
Now let's define the rest of the tokens:
whiteSpaces := '\s*' (That's regex for one-or-more spaces/tabs) folderName := name keyName := name value := .+ (Regex for 'match any string') name = '\w+' (One or more alpha-numeric characters)
Let's make a compiler for these kind of lines. Note that this compiler is only going to be used by you, and is merely temporary -- it's only purpose is to generate the regex strings we'll be using for our final parser.
Now lets turn those lines into mIRC lines:
alias regexFoo { var %ws = \s* var %name = \w+ var %folderName = %name var %keyName = %name var %value = .+ var %folderLine = %folderName %ws / %ws $chr(123) var %keyLine = %keyName %ws = %ws %value ; Add ^ and $ to signify that we're matching one whole line %folderLine = ^ $+ $remove(%folderLine,$chr(32)) $+ $ %keyLine = ^ $+ $remove(%keyLine,$chr(32)) $+ $ ; Show the result echo -s * folderLine := %folderLine echo -s * keyLine := %keyLine }
Note that i'm using $remove(...,$chr(32))
to remove the spaces.
That's because $+
is ugly and I don't want to use it, and we won't
be using spaces anyway (our whitespaces are \s
). Running the alias
(our little "compiler"), we get:
* folderLine := ^\w+\s*/\s*{$ * keyLine := ^\w+\s*=\s*.+$
Testing our the keyLine
regex expression to various strings with $regex
, we
get:
Test string | Result |
---|---|
cowSound = moo | OK |
cowSound=moo | OK |
cowSound = moo | OK |
oink | Fail |
= oink | Fail |
pigSound = | Fail |
Perfect!
Time to expand. We got a parser to tackle key=value
pairs. But
what about keys with attributes, i.e.: key.attribute = value
?
Let's modify our code, mainly keyLine
:
; Follow idName and folderName. var %attribName = %name ... ; Wrap the attribute string in "(" and ")?" ; "?" means "match zero or one of these" ; Also note that the . is escaped with a backslash since the dot is a ; special regex character. var %keyLine = %keyName ( %ws \. %ws %attribName )? %ws = %ws %value
We include ( and )? to signify that it's optional. But the OC in me cringes at the sight of regex code that seems to be camouflaging with our string -- so, lets improve on this and make it more semantically nice. We are, afterall, going for manageable and readable code:
; Look! I'm using "(?: ... )?" instead of just "()?". ; Notice the extra "?:". What does this mean? Nothing. ; ...Really. More on this later. Ignore it for now. alias -l OPTIONAL return (?: $+ $1- $+ )? ... ... var %keyLine = %keyName $OPTIONAL( %ws \. %ws %attribName ) %ws = %ws %value ...
Perfect. We just added attribute
! Now for the optional
type
.
(i.e.,[type:]key[.attribute] = value
)
... ; Follow idName and attribName. var %typeName = %name var %typeNameOptional = $OPTIONAL( %typeName %ws : %ws ) ... var %keyLine = %typeNameOptional %keyName $OPTIONAL( %ws \. %ws %attribName ) %ws = %ws %value ...
Here's our assembled code so far: (blue lines are the ones we added)
alias -l OPTIONAL return (?: $+ $1- $+ )? alias regexFoo { var %ws = \s* var %name = \w+ var %folderName = %name var %keyName = %name var %attribName = %name var %typeName = %name var %value = .+ var %typeNameOptional = $OPTIONAL( %typeName %ws : %ws ) var %folderLine = %folderName %ws / %ws $chr(123) var %keyLine = %typeNameOptional %keyName $OPTIONAL( %ws \. %ws %attribName ) %ws = %ws %value %folderLine = ^ $+ $remove(%folderLine,$chr(32)) $+ $ %keyLine = ^ $+ $remove(%keyLine,$chr(32)) $+ $ echo -s * folderLine := %folderLine echo -s * keyLine := %keyLine }
Time to "compile".
* folderLine := \w+\s*/\s*{ * keyLine := ^(\w+\s*:\s*)?\w+(\s*\.\s*\w+)?\s*=\s*.+$
Time to test keyLine
. (Lets not worry about
folderLine
for now)
Test string | Result |
---|---|
animal: cow = moo | OK |
animal:cow=moo | OK |
cow.sound = moo | OK |
cow . sound =moo | OK |
animal:cow.sound = moo | OK |
animal: pig = | Fail |
animal: = oink | Fail |
animal: .sound = baa | Fail |
Perfect! Just as expected.
Good, we got it to parse 'em strings. Now how are we going to fish out the
parts we need? (i.e., the type
, key
,
attribute
, and value
)
mIRC has $regml
to find out what we matched. It does this by
simply returning the exact string that matched a part of the expression enclosed
in parentheses.
Parentheses is pronounced as paREN-thee-seez. Same with crises, nemeses, and the rest of the is->es plurals.
Back to mIRC scripting: lets modify our %folderName
, %typeName
,
%keyName
, %attribName
, %value
variables:
alias -l SIGNIF return ( $+ $1- $+ ) ... var %folderName = $SIGNIF( %name ) var %keyName = $SIGNIF( %name ) var %attribName = $SIGNIF( %name ) var %typeName = $SIGNIF( %name ) var %value = $SIGNIF( .+ ) ...
...With the little $SIGNIF
alias there to add aid in semantics,
hehe.
Now, why did we have the extra ?:
in $OPTIONAL
?
This is so that $regml
will skip that group. (This doesn't affect $regex
's parsing.)
Here we go:
alias -l OPTIONAL return (?: $+ $1- $+ )? alias -l SIGNIF return ( $+ $1- $+ ) alias regexFoo { var %ws = \s* var %name = \w+ var %folderName = $SIGNIF( %name ) var %keyName = $SIGNIF( %name ) var %attribName = $SIGNIF( %name ) var %typeName = $SIGNIF( %name ) var %value = $SIGNIF( .+ ) var %typeNameOptional = $OPTIONAL( %typeName %ws : %ws ) var %folderLine = %folderName %ws / %ws $chr(123) var %keyLine = %typeNameOptional %keyName $OPTIONAL( %ws \. %ws %attribName ) %ws = %ws %value %folderLine = ^ $+ $remove(%folderLine,$chr(32)) $+ $ %keyLine = ^ $+ $remove(%keyLine,$chr(32)) $+ $ echo -s * folderLine := %folderLine echo -s * keyLine := %keyLine }
Output:
* folderLine := (\w+)\s*/\s*{ * keyLine := ^(?:(\w+)\s*:\s*)?(\w+)(?:\s*\.\s*(\w+))?\s*=\s*(.+)$
Further testing. This is done by:
; The result of this is > 0 if it runs right $regex(KEYLINE, string, keyLine) ; Loop through these results $regml(KEYLINE, number)
The results:
Test string | Results |
---|---|
animal: cow .sound = moo | 1 [animal], 2 [cow], 3 [sound], 4 [moo] |
cowSound = moo | 1 [cowSound], 2 [moo] |
bird.wings = 2 | 1 [bird], 2 [wings], 3 [2] |
Now use those regex strings, make your own parser based on
/fOpen
, $fRead
, and the rest of the f*
functions. (Don't use $read
-- it's terribly slow)
You won't need to package the compiler with your script, heh. Keep it to yourself, then when you need to revise your format, edit your compiler's code and generate a new regex string.