Tutorial - Hashing Tutorial

Introduction

Most scripters already know that using hash tables is faster than using other methods of storing information, like using .ini files. A lot of people, though, don't know why exactly hash tables are faster.

Part of their speed comes from their residence in memory, while .inis--unless cached-- reside on the hard disk. Operations that deal with reading from or writing to disk are faster than operations that deal with memory.

But that's not all of it. If you had a cached .ini file and pitted it against a hash table that contained the same information (using item names that combine section and item in .inis), hash tables would still be faster on average.

Why is this? It comes from the way a hash table is set up; it's a data structure, and not simply a file that gets parsed every time you need to access it.

Linear Search vs. Hashing

If you had to write an alias that does what /writeini and $readini do, you'd probably do it the same way I would: /writeini would search through every line of the file, and overwrite the appropriate line when it found it. At the end, if it didn't find the line, it'd add a new one under the right section. $readini would scan through every line, and return the right line under the right section. If it didn't find anything to return, it'd stop and return $null.

This is an example of linear searching, so called because the amount of time it takes to execute is based (roughly) on the amount of lines in the file. If you had to graph execution time, you'd get a line.

Now, this isn't a problem if you have small files or a lot of time. But as files get bigger, writing and reading take longer, especially if the section is at the end of the file.

So, if hash tables don't use linear search, what do they use?

They use hashing. [Note: this has nothing to do with $hash().]

What is hashing?

The best way to describe hashing is to imagine a bunch of buckets. Say that we have five buckets, numbered 0 to 4, each filled with index cards. On the front, the index cards have a phone number; on the back, they have a name.

Now saw we want to find out who has the number "8675309". We could either do a linear search, looking through each bucket until we find the right card, or we could cheat.

Before we set up these buckets for holding phone numbers, we establish a system that tells us where to put certain phone numbers. We create a hash function that gives us a value based on the phone number. We then divide that number by 5 (the number of buckets) and take the remainder. This way, we know what bucket to look into, based on the number that we're looking for.

But wait a second. Why are we taking the remainder after dividing by 5 (or simply f(x) % 5, where f(x) is our hash function)? This makes it easier to use our hash function for phone numbers later; we have it return simply a value, not a bucket number. This way, if we wanted to add a sixth bucket, we wouldn't have to change the function, just change the divisor.

So now, before we add an index card to our buckets, we have to figure out which bucket it goes into. And when we want to take an index card out, we have to do the same thing.

But it's easier to search through one bucket instead of five. Imagine if "8675309" was in the fifth bucket, at the bottom. We wasted all that time, when we could have just set up this hashing system.

Hash Functions

Now, if we want to implement this phone-numbers-in-a-bucket system, we really ought to figure out what hash function we want to use. We could add the first and last number of the phone number, or we could add the the middle three, or we could just return the middle number. These are decent hash functions.

A bad idea is returning its length (this applies only if phone numbers in your home country don't have differing lengths--I realize some places like the United Kingdom have differing lengths based on regions' populations, which is more logical than United States' numbers, but that's not the point of this tutorial). A worse idea is returning a value based on the time, or the date, or something else like that.

Why? Because, we want to get the same value out of our function at all times. If your function returns $ctime, we will search in the wrong bucket most of the time.

Pulling It All Together

Let's assume that we want to script a "replacement" for hash tables in mIRC.

We establish our hash function to be the sum of the first and last numbers in the phone number.

How do we set up our buckets? As messy as it would be, I would suggest hidden @windows. The way I've seen them done in "real" languages is through using an array of queues, but this is the simplest way to implement hash tables in mIRC.

So how does our replacement work?

On /hmake N, we create N number of buckets. (I bet some of you have wondered why exactly we have to specify a number of "slots." mIRC allows you to vary the number of buckets your table contains.)

On /hadd , we compute the bucket number, and /aline the item and value to the @window. On /hdel , we find the right bucket, search through the lines in the @window, and delete the right line.

On $hget(,), we find the bucket number, find the right line, and return it.

The other functions aren't completely necessary, but I'm sure you could figure out a way to implement them as well.

Conclusion

Hopefully, after reading this you've come to see how cheating and knowing where to look for data is a lot faster than searching through an entire file (or another linear data structure). Even though the last search is linear, searching through ten items out of one-hundred is faster than searching through all one-hundred.