A HashTable or a HashMap is a key-value based data structure. It works by mapping keys to values
using a hash function. HashMap’s have on average time complexity of O(1)
for insertion,
deletion and fetching.
Under the hood they use arrays to store the values. The only difference is we access values based on the key’s and not the index, which makes them faster.
The index at the item (both key & value) is stored at is calculated by a hash function and the index is also referred to as hash code.
Table of contents:
A hash function is a function that takes the key as an input and outputs the number. The output is our hash code or the index at our item will be stored in the HashTable.
Hash function are deterministic in nature, which simply implies when provided with the same input multiple times, they produced the same output!
Output generated by hash functions should always be in a certain range, typically in between the size of the underline array.
Because the key can be any string and hash functions returns a value in specific range, there is a possibility that the hash function will return similar output for different inputs, these are called collisions.
For e.g. If we wrote a hash function to return output between 0 and 7, and we pass it 9 unique inputs, we’re guaranteed at least 1 collision.
hash("to") == 3
hash("the") == 2
hash("café") == 0
hash("de") == 6
hash("versailles") == 4
hash("for") == 5
hash("coffee") == 0
hash("we") == 7
hash("had") == 1
We have two ways to deal with collisions —
-
Separate Chaining
In this we use linked list to address the loading. If the key’s are colliding at a certain index, we store a linked list at that index including the colliding elements
-
Open Addressing
In this, if collision occurs, we store the incoming item at index + 1 to avoid the collision. In open addressing, we have to keep resizing the underline array to store upcoming items.
To create our hash function we can use djb2
algorithm created by the
Daniel J. Bernstein. More details
here.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
Above algorithm in details —
-
The hash value is initialized with
5381
. This number was chosen by the creator upon some empirical testing, but it’s not necessarily special beyond serving a good starting point. As the rule says, ”If something works, don’t change it!”. -
Then we loop through every character from the
input
string. -
Inside the loop, on every iteration we update our hash value w/ fallowing formula,
$$ hash = ((hash << 5) + hash) + c; $$ -
We first left shift the current hash value by 5 bits. Which is equivalent of multiplying the value by 32, i.e. 2^5.
For e.g.
5
, it is represent as101
in binary. If we left shift this by 5 bits (which basically means adding 5 zeros at end) gives us10100000
which is binary for160
. If we do5 * 32
we get exactly the same value. -
After shifting, we then add original hash value back to the shifted value. Which is equivalent of
hash * 33
-
Finally ASCII value of current character is added, and calculated value is then interchanged with
hash
.
-
Important
In early computing, multiplication was generally more computationally expensive than bitwise operations. Bitwise shifts and additions were faster and required less hardware. So directly multiplying by 33 is slower then left shift and addition.
Tip
But modern compilers are very good at optimizing code. They might even recognize
that hash * 33
can be optimized using a left shift and addition, so there's
often no performance difference in modern systems.
For e.g. If our key is “ab”, in our loop we do fallowing —
ASCII value of a is 97 and for b is 98.
Note
ASCII, or American Standard Code for Information Interchange, is a character encoding standard used for electronic communication. It assigns numeric values to letters, numerals, punctuation marks, and other characters, allowing computers to represent and manipulate text data effectively.
- For
a
, we left shift our current hash value which is5381
, i.e.5381 << 5 = 172192
, then we add the original hash, which was5381
and then we add97
(ASCII code for char a). We get177670
. - For
b
, we left shift our current hash value which is177670
, i.e.177670 << 5 = 5685440
, then we add the original hash, which was177670
and then we add98
(ASCII code for char a).We get5863488
.
When initializing a HashTable, we start with a fixed size. To manage collisions and maintain efficient performance, we use a load factor to determine when to resize the table. Specifically, we extend the size of our hash table when it reaches a certain capacity.
For e.g. in our hash table if we have load factor of 75%, then we will extend the size of our table if it reaches 75% of its capacity.
Important
But why does it matter?
If our table exceeds the load factor (generally 75%), it increases the likelihood of collisions in our table, degrading its performance. By resizing our hash table we reduce the probability of collisions and maintain efficient operations speed.
As we are using Open Addressing with linear probing as our solution to avoid collisions, it may create clusters in our HashTable.
Clustering in HashTable refers to situation where multiple keys hash to nearby slots, causing consecutive elements to form a cluster in the table.
Above is the visualization of clustering in HashTable created by the amazing Sam Rose
This makes it harder to find empty slots for newer insertions and increase the time taken for insertion operation.
Imagine a hash table where the following elements are inserted —
- Insert key A at index 2.
- Insert key B, which hashes to index 2 (collision), so it's placed at index 3.
- Insert key C, which hashes to index 3 (another collision), so it's placed at index 4.
Now, we have a small "cluster" from index 2 to 4. Every new insertion or search that hashes near this range will likely end up in this cluster, causing more probing and slowing down operations.
To avoid clustering we can —
-
Quadratic Probing
In this, if collision occurs at
index
then, instead of looking at the next slot (index +1), the probing distance increases quadratically (e.g., 1, 4, 9, etc.), i.e.index + 4
,index + 9
, etc. This reduces the likelihood of clustering. -
Double Hashing
In this we use a second hash function to determine the distance (step size) to calculate new index for the element. This makes it hard to land values close together to make it less likely to form a cluster.
In Rust, overflow occurs when an arithmetic operation produces a value that is too large to fit into the type being used to store it. This happens when the result exceeds the maximum value that the type can represent.
let mut a: u8 = 250;
a = a + 10;
Here the type u8
can store values in range 0
to 2^8 - 1
i.e. (0 - 255), so if we add 10 in 250,
the result should be 260. However, since 260 is out of range of u8
's capacity, our operation overflow‘s
and results in panic.
Note
Our program will only panic in debug mode to make us alert of this bug. In release mode our program will
automatically wrap the value, resulting a
's value to be 4
. How? — (250 + 10) % 256 = 4