Skip to content

Hash Table in rust from scratch with djb2 algorithm

Notifications You must be signed in to change notification settings

frozen-beak/hash_table

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#️⃣ Hash Table

A HashTable or a HashMap is a key-value based data structure. It works by mapping keys to values using a hash function. HashMap’s have on average time complexity of O(1) for insertion, deletion and fetching.

Under the hood they use arrays to store the values. The only difference is we access values based on the key’s and not the index, which makes them faster.

The index at the item (both key & value) is stored at is calculated by a hash function and the index is also referred to as hash code.

Table of contents:

Hash Function

A hash function is a function that takes the key as an input and outputs the number. The output is our hash code or the index at our item will be stored in the HashTable.

Hash function are deterministic in nature, which simply implies when provided with the same input multiple times, they produced the same output!

Output generated by hash functions should always be in a certain range, typically in between the size of the underline array.

Collisions

Because the key can be any string and hash functions returns a value in specific range, there is a possibility that the hash function will return similar output for different inputs, these are called collisions.

For e.g. If we wrote a hash function to return output between 0 and 7, and we pass it 9 unique inputs, we’re guaranteed at least 1 collision.

hash("to")         == 3
hash("the")        == 2
hash("café")       == 0
hash("de")         == 6
hash("versailles") == 4
hash("for")        == 5
hash("coffee")     == 0
hash("we")         == 7
hash("had")        == 1

We have two ways to deal with collisions —

  1. Separate Chaining

    In this we use linked list to address the loading. If the key’s are colliding at a certain index, we store a linked list at that index including the colliding elements

  2. Open Addressing

    In this, if collision occurs, we store the incoming item at index + 1 to avoid the collision. In open addressing, we have to keep resizing the underline array to store upcoming items.

Creating a Hash Function

To create our hash function we can use djb2 algorithm created by the Daniel J. Bernstein. More details here.

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

Above algorithm in details —

  1. The hash value is initialized with 5381 . This number was chosen by the creator upon some empirical testing, but it’s not necessarily special beyond serving a good starting point. As the rule says, ”If something works, don’t change it!”.

  2. Then we loop through every character from the input string.

  3. Inside the loop, on every iteration we update our hash value w/ fallowing formula,

    $$ hash = ((hash << 5) + hash) + c; $$

    • We first left shift the current hash value by 5 bits. Which is equivalent of multiplying the value by 32, i.e. 2^5.

      For e.g. 5, it is represent as 101 in binary. If we left shift this by 5 bits (which basically means adding 5 zeros at end) gives us 10100000 which is binary for 160. If we do 5 * 32 we get exactly the same value.

    • After shifting, we then add original hash value back to the shifted value. Which is equivalent of hash * 33

    • Finally ASCII value of current character is added, and calculated value is then interchanged with hash.

Important

In early computing, multiplication was generally more computationally expensive than bitwise operations. Bitwise shifts and additions were faster and required less hardware. So directly multiplying by 33 is slower then left shift and addition.

Tip

But modern compilers are very good at optimizing code. They might even recognize that hash * 33 can be optimized using a left shift and addition, so there's often no performance difference in modern systems.

For e.g. If our key is “ab”, in our loop we do fallowing —

ASCII value of a is 97 and for b is 98.

Note

ASCII, or American Standard Code for Information Interchange, is a character encoding standard used for electronic communication. It assigns numeric values to letters, numerals, punctuation marks, and other characters, allowing computers to represent and manipulate text data effectively.

  • For a , we left shift our current hash value which is 5381 , i.e. 5381 << 5 = 172192, then we add the original hash, which was 5381 and then we add 97 (ASCII code for char a). We get 177670 .
  • For b , we left shift our current hash value which is 177670 , i.e. 177670 << 5 = 5685440, then we add the original hash, which was 177670 and then we add 98 (ASCII code for char a).We get 5863488 .

Load Factor

When initializing a HashTable, we start with a fixed size. To manage collisions and maintain efficient performance, we use a load factor to determine when to resize the table. Specifically, we extend the size of our hash table when it reaches a certain capacity.

For e.g. in our hash table if we have load factor of 75%, then we will extend the size of our table if it reaches 75% of its capacity.

Important

But why does it matter?

If our table exceeds the load factor (generally 75%), it increases the likelihood of collisions in our table, degrading its performance. By resizing our hash table we reduce the probability of collisions and maintain efficient operations speed.

Clustering

As we are using Open Addressing with linear probing as our solution to avoid collisions, it may create clusters in our HashTable.

Clustering in HashTable refers to situation where multiple keys hash to nearby slots, causing consecutive elements to form a cluster in the table.

Visualization of clustering in HashTable created by the amazing Sam Rose

Above is the visualization of clustering in HashTable created by the amazing Sam Rose

This makes it harder to find empty slots for newer insertions and increase the time taken for insertion operation.

Imagine a hash table where the following elements are inserted —

  • Insert key A at index 2.
  • Insert key B, which hashes to index 2 (collision), so it's placed at index 3.
  • Insert key C, which hashes to index 3 (another collision), so it's placed at index 4.

Now, we have a small "cluster" from index 2 to 4. Every new insertion or search that hashes near this range will likely end up in this cluster, causing more probing and slowing down operations.

To avoid clustering we can —

  1. Quadratic Probing

    In this, if collision occurs at index then, instead of looking at the next slot (index +1), the probing distance increases quadratically (e.g., 1, 4, 9, etc.), i.e. index + 4, index + 9, etc. This reduces the likelihood of clustering.

  2. Double Hashing

    In this we use a second hash function to determine the distance (step size) to calculate new index for the element. This makes it hard to land values close together to make it less likely to form a cluster.

Overflow In Rust

In Rust, overflow occurs when an arithmetic operation produces a value that is too large to fit into the type being used to store it. This happens when the result exceeds the maximum value that the type can represent.

let mut a: u8 = 250;
a = a + 10;

Here the type u8 can store values in range 0 to 2^8 - 1 i.e. (0 - 255), so if we add 10 in 250, the result should be 260. However, since 260 is out of range of u8's capacity, our operation overflow‘s and results in panic.

Note

Our program will only panic in debug mode to make us alert of this bug. In release mode our program will automatically wrap the value, resulting a's value to be 4. How? — (250 + 10) % 256 = 4

About

Hash Table in rust from scratch with djb2 algorithm

Topics

Resources

Stars

Watchers

Forks

Languages