Tutorial - Advanced Regular Expressions

This tutorial aims to go one step further than the tutorial on regular expressions that already exists on this site, and you are presumed to already be well versed with the basics of regular expressions i.e. how to form them and use them to match patterns in strings. If you are new to this and haven't yet read Trystan's tutorial on regular expressions please do so before continuing (you will only get confused otherwise) at http://www.mircscripts.org/comments.php?id=989

Alright I'll begin with back references, (which also involves $regml). In the first tutorial, you came across parentheses which were used to separate parts of an expression. For example:

//echo -a $regex(this is a test,/this is a (test|balloon|sausage)/)

The expression looks for "this is a" followed by either "test" or "balloon" or "sausage", which is matched in the string and so a value of "1" will be returned as you know to tell us that the string matches the pattern expressed.

However, what you weren't told before was that enclosing an item in parentheses makes mIRC "remember" what was matched inside the parentheses. This is called a back reference. So in this case, mIRC stores the word "test" because it was that that was matched in the string. You can refer to this value by using \N where N is the Nth back reference to refer to, which for this example would be \1 as (test|balloon|sausage) was the first back reference in the expression. This \1 however can only be used in an expression, for example:

//echo -a $regex(this is a test,/this is a (test|balloon|sausage) \1/)

We know from the first example that the expression will match the "this is a test" part of the string, but when you stick the \1 in there, it tells mIRC to look for the first back reference. So since the first back reference value is "test", that expression looks for "this is a test test" in the string. This will return 0 though because there is not another occurrence of "test" after the "test" in the string, there is only one. On the other hand, $regex(this is a test test,this is a (test|balloon|sausage) \1) will return 1 because the expression matches "test" then \1 matches "test" again, so both words are matched. Here is an example of an appropriate use for this:

Suppose you want to match a string that looks like "I am smarter than you but Sigh is smarter than me" but the word "smarter" can be a variety of words, such as braver/cooler/stronger, providing whatever word it is, it is in both places. A regular expression applicable is:

/I am (smarter|braver|cooler|stronger) than you but Sigh is \1 than me/

As you can see, the \1 takes the first back referenced value and substitutes it in there so that regex will match "I am braver than you but Sigh is braver than me" etc. Similarly, if you have 2 items enclosed in parentheses you use \1 to refer to the first and \2 to refer to the second. If you want to refer to them outside of the expression, that is where $regml comes in.

The "ml" in $regml stands for "matched list" and its function is to remember the back referenced values in a regular expression. It's easy to use, just $regml(N) to get the Nth back referenced item (which inside the expression would be \N). Try the following:

//echo -a $regex(Hello world I am Sigh,/Hello world I (am|was|will be) Sigh/) - $regml(0) - $regml(1)

You will see "1 - 1 - am" echoed, the first "1" returned by the call to $regex to say the expression matched the string successfully, the second "1" to indicate there is 1 back referenced value (as with most identifier that deal with an N parameter, $regml(0) returns total amount) and the "am" is the first back referenced value.

To recap:

- Items enclosed in parentheses can be referenced within an expression as \N
- Outside of an expression they can be referenced by $regml(N)

Let's have a couple of examples to further demonstrate their usage:

//.echo -q $regex(Sigh is 123 years of age and drives a red Ferrari or so he wishes,/Sigh is (\d+) years of age and drives a (\w+) Ferrari/) | echo -a $regml(0) back referenced values, first: $regml(1) - second: $regml(2)

It's simple; the \d+ in the first set of parentheses looks for one or more numbers, which will grab the age. The \w+ in the second set will look for a bunch of word characters (so it stops at the next space) which retrieves the color of my Ferrari. Then these values are echoed in the next command. Of course, you can use $regml in an if statement, while loop etc. as it is a normal identifier and remembers the back referenced values from the last call to $regex. If you provide a name for a call to $regex (as shown in the "name" parameter of the identifier) then it provides a name for which to refer to as $regml(name,N) and mIRC can store 10 of these before they begin to get overwritten.

//.echo -q $regex(I enjoy playing basketball because basketball helps me relax. I also enjoy football,/I (hate|enjoy) playing (basketball|tennis|chess) because \2 helps me relax\. I also \1 (football|hockey|IRC)) | echo -a $regml(0) - $regml(1) - $regml(2) - $regml(3)

This is similar to what we were looking at before. \1 is referred to in there to get the value of the first back reference, it is then referred to outside the identifier as $regml(1), similarly \2 is used inside the expression and $regml(2) is used outside.

Now let's have a look at $regsub. If you understand what you have read so far, you already know how to use it. It works exactly the same as $regex except you are able to replace everything that your expression matches with specific text. At its most basic, it has a similar function to that of $replace and the only pain using it is that you must create a local variable in which to store the result of a substitution since $regsub() returns the number of substitutions made and not the final string with the substitutions. Let's look at the syntax:

$regsub([name], text, re, subtext, %var)

The [name] part is the same as that in $regex, it assigns a name to be used for back references with $regml and is also optional. The "text" part is the initial string containing whatever it is you want substitutions made in. "re" is the regular expression to use, "subtext" is the text to substitute in place of anything the expression matches and %var is the name of the variable (local or global) to dump the result in to. To experiment with it the basic method of viewing your substitutions is first a /var command followed by your echo: //var %temp | echo -a $regsub(...,%temp) - %temp. For example:

//var %temp | echo -a $regsub(string,/s/,b,%temp) - %temp

That echoes "1 - btring" where "1" is the number of substitutions made and %temp is the result of the substitution. Inside the identifier "string" is our string to start with, /s/ is our regular expression that matches a single letter "s" and "b" is the text to replace whatever is in the string that was matched by the regex with. %temp of course is the local variable we declared before echoing. This would have had the same effect as $replace(string,s,b) but only in this instance. $replace as you know replaces all occurrences of a substring but the regular expression we have used in this substitution only replaces the first instance of "s". So if you change "string" to "strings" you will see that %temp is filled with "btrings" and not "btringb".

This is because we have not specified the "g" switch in the expression, which would indicate to mIRC we want a global match (to match all occurences of the pattern expressed in the regex and not just one). Try the same command with $regsub(strings,/s/g,b,%temp) and see what happens. Now, back references are possible in $regsub just as they are in $regex, and they can even be used in the "subtext" part of the identifier. Let's go through a couple of examples:

//var %temp | echo -a $regsub(I am Sigh and I am cool,/am/g,was,%temp)

Because the "g" switch was used, it replaces all parts of the string matching the regular expression which in this case is a word, "am". So the result is "I was Sigh and I was cool". 2 substitutions were made so the $regsub identifier returns 2.

Let's look at something more complicated such as removing HTML tags from text. First I must tell you what ^ in a character class represents. A character class is what you came across in the first tutorial, a group of characters enclosed in square brackets that could include ranges (such as [a-zA-Z0-9]). A class such as [a] matches the letter "a" and is the same as matching "a" outside a class. However, if you use the ^ character after the opening bracket like [^a] that matches any character that is not an "a".

i.e.
$regex(b,[^b]) returns 0 since there is no single character in "b" that isn't a "b".
$regex(a,[^b]) returns 1 because the character "a" is matched by the class [^b] because it isn't 'b".

This is important in thinking of a regular expression to match HTML tags so they may be removed (by removed I mean use $regsub to substitute HTML tags with $null). The first thing to do in this case is to think "What regular expression can I use to match an HTML tag?". So you begin by making the expression:

You know any HTML tags begins with a <, so that is the first part of the regex. That < is followed by one or more characters followed by an ending >. At a first try, your expression may end up looking like this:

/<.+>/g

You can check this quickly like so:

//var %temp | echo -a $regsub(<tag>Text</tag>,/<.+>/g,-,%temp) - %temp

So we hope to replace every HTML tag with a hyphen for testing purposes. As you can see if you type this, it replaces the whole string with a single hyphen. This is because the expression <.+> matches < followed by one or more characters followed by > but mIRC tries to match as many characters as possible with .+ so it ends up matching everything from the first < to the last >. We don't want this. We only want mIRC to match up until the next > which is the end of the corresponding HTML tag. So instead of .+ to match characters within an HTML tag, it would be more applicable to use [^<>]+ which is a character class representing any character except < or >. An alternative is the shorter .+? where the question mark calls for a non-greedy match i.e. mIRC will try to match as few characters as possible. Our problem is solved:

//var %temp | echo -a $regsub(<tag>Text</tag>,/<[^<>]+>/g,-,%temp) - %temp

-Text- is echoed showing us that the two tags have successfully been replaced with -. All that needs to be done to remove them is change the subtext part of $regsub to $null or leave it empty.

Let's recap:

- $regsub works like $regex but substitutes the text you specify into any area that is matched by the regular expression you give it
- \N can be used both in the expression and in the subtext (eg. $regsub(Sigh is cool,/(is|was)/,\1,%var) to put the first back referenced value in the substituted text
- Using ^ at the beginning of a character class will negate it i.e. [^A-Z] matches any character that is not capital letters A to Z.
- When building a regex substitution you may find it more comfortable to consider the expression to use first, thinking about what needs to be matched in order to substitute the text in the correct places.

Now hopefully you are comfortable with $regex and $regsub and have experimented with the both. The key to mastering this is practice and experimentation to see what can and cannot be done. Challenge yourself by thinking of patterns to match and try to express them with a regular expression. Let's move on to a slightly more advanced component of regex.

You have already come across "assertions" such as \w \s \d etc. but what I am going to talk about involves assertions as sub patterns. These are ones that do not consume any characters meaning they simply take a quick look at what is ahead of something or what is behind something (so for example using $regsub with such assertions will not match them and so they will not be substituted). An assertion that looks like (?<=) is a positive look behind assertion, (?=) is a positive look ahead assertion, (?<!) is a negative look behind assertion, (?!) is a negative look ahead assertion and (?:) is a non-capturing assertion.

I'll begin with (?:) as it is easiest to comprehend. A good way to consider its use is to think of a regular expression in which you use parentheses to separate sub patterns but do not refer to them via \N or $regml. mIRC will store them for you anyway temporarily, but if you don't need it to be stored you can stop mIRC from capturing it using (?:) in place of (). For example:

if ($regex($1-,\d\.(\w|\d|\s)+)) { commands }

Let's suppose you don't refer to the word character, digit or space matched in (\w|\d|\s) at all in the if statement. mIRC captures it for you anyway because you enclose it in parentheses. So it is a good idea to use a non-capturing assertion inside your expression:

if ($regex($1-,/\d\.(?:\w|\d|\s)+)/) { commands }

It doesn't change anything related to what you can or cannot include inside the parentheses, the only difference is that you may not use \1 later in the expression to refer to the value matched by the assertion, neither can you use $regml(1) later to reference what was matched.

The (?<=) is a positive look behind assertion, positive meaning it matches an occurrence of the sub pattern whereas a negative one would match anything but an occurrence of the sub pattern. The look behind means that it looks behind the current matching point for the specified expression. If you consider the parser to move through each character in the string one by one, and it is currently on the letter "g" in "Sigh" using a look behind assertion will not affect the position of the matching point but only take a peek at what is behind the "g". For example:

//echo -a $regex(Testing assertions,/(?<=Testing)\sassertions/)

This matches [space]assertions in front of the word "Testing". You may ask how this is different to simply using /Testing\sassertions/. Let's think of it in terms of the matching point. For the example shown, the matching point is at the space. It looks behind for "Testing", taking only a non-capturing look at what is behind it, then attempts to match what comes after. However, when you use /Testing\sassertions/ the matching point is at the beginning of the string at "T".

The next assertion is (?=) a positive look ahead assertion. It it used in the same way as the previous one, only it takes a look at what comes after the matching point. Here is an example:

//echo -a $regex(Testing assertions,/Test(?=ing)/)

This looks for "Test" followed by "ing" and matches the string. Again this may appear to be the same as simply using /Testing/. The different in this instance can be seen by using $regsub, try the following commands:

//var %temp | echo -a $regsub(Testing,/Test(?=ing)/,*,%temp) - %temp
//var %temp | echo -a $regsub(Testing,/Testing/,*,%temp) - %temp

Although for both of those only one substitution was made, the first only substituted "Test" whereas the second substituted "Testing". If you consider the regular expression used in the second call to $regsub, it matches the whole string. But as I said before, these assertions are non-capturing and therefore are not considered to be a part of the string matched in the substitution. Test(?=ing) matches "Test" but only looks ahead for the "ing" so it is not substituted.

The next assertion to look at is (?<!), a negative look behind assertion. Relate this to our positive look behind assertion the only difference being that the negative one indicates to not match the sub pattern. [a] is to [^a] as (?<=a) is to (?<!a). This is used when you have something to match but it should only be matched if it isn't preceded by something else. In this case you may have something like:

//echo -a $regex(I am Sigh,/(?<!am\s)Sigh/)

So as soon as the matching point lands on "S" it takes a quick look behind to make sure there is not an occurrence of the sub pattern inside, which is the word "am" followed by a space. Unfortunately, there is an occurrence of this behind the S, so the identifier returns 0. Had it been /(?<!was\s)Sigh/ then 1 would have been returned to indicate that "Sigh" was found in the string and it wasn't preceded by "was ".

Finally we come to (?!) which is a negative look ahead assertion. By now you should be able to identify that this looks ahead of the matching point and doesn't match what is expressed as the sub pattern. It goes without saying that [a] is to [^a] what (?=a) is to (?!a). So:

//echo -a $regex(I am still Sigh,/still(?!\sSigh)/)

Returns a value of "0" because although "still" does exist in the string, when it hits the negative assertion that doesn't match a space followed by "Sigh", this is matched in the string, so the assertion isn't matched.

As I explained before, these assertions are non-capturing. The look behind and look ahead assertions are ideal for use in $regsub because of this, where you may not want to substitute what comes before or after a certain pattern but merely want to take a little look ahead or behind for something. Let's re-cap:

- (?:) is a non-capturing assertion that should be used to separate sub patterns that you don't need to back reference with \N or $regml
- (?<=) is a positive look behind assertion to take a small look behind the matching point for the specific sub pattern

- (?=) is a positive look ahead assertion to take a small look ahead of the matching point for the sub pattern
- (?<!) indicates a negative look behind assertion which is similar to its positive counterpart only won't match the sub pattern expressed
- (?!) is a negative look ahead assertion similar to positive look ahead only won't match the sub pattern
- These are ideal in $regsub where you may not want to substitute what comes after or before a pattern

I know I haven't included any enormous or complex regular expressions in this document, this is because these kinds of expressions aren't really complex and if you pick them apart bit by bit you can very easily see how they work. I hope this tutorial has made some things clear for you, although bits of it would undoubtedly have confused you further than you already might have been for which I apologize. The key really is practice, setting yourself challenges is always good fun. I know there are certain things I have left out, my excuse for this laziness is that I have left it up to you to go away and discover these things by yourself (a good learning experience perhaps). Good luck, and please don't hesitate to ask any question, request anything that I should go over perhaps in more depth.

Also, have a look through http://www.pcre.org/man.txt as it is really the bible for regular expressions in mIRC. It contains some information that arguably goes into too much technical detail but otherwise there is a lot that can be learnt from it.

* The End *