- references
- http://openjdk.java.net/jeps/180
- https://mincong-h.github.io/2018/04/08/learning-hashmap/
- https://www.nurkiewicz.com/2014/04/hashmap-performance-improvements-in.html
- https://www.javarticles.com/2012/11/hashmap-faq.html
- https://javaconceptoftheday.com/how-hashset-works-internally-in-java/
- https://www.geeksforgeeks.org/hashing-set-3-open-addressing
TDD - everybody knows that TDD stand for test driven development; however people too often concentrate on the words: test and development, and don't conciders what the world driven really implies. For tests to drive development they must do more than just test that code performs its required functionality: they must clearly express that required functionality to the reader.
Nat Pryce and Steve Freeman, Are your tests really driving your development?
- goals of this workshop:
- practice TDD
- practice asking right questions during development
- become acquainted with implementation of
HashMap
- chance to discuss some bitwise operations
- exemplary solution is in
answers
package
- signed left shift operator
<<
- shifts a bit pattern to the left
- used in counting power of two
- e.g.
1 << 4 = 16
- an unsigned right shift operator
>>>
- shifts a bit pattern to the right regardless sign
- e.g.
A = 60 = 0011 1100
=>A >>> 2 = 0000 1111
and0000 1111 = 15
transient Node<K,V>[] table;
HashMap
useswriteObject
andreadObject
to implement custom serialization rather than just letting its field be serialized normally- it writes the number of buckets and entries to the stream and rebuilds itself from those fields when deserialized
- table itself is simply unnecessary in the serial form
- hash function
static final int hash(Object key) { int h; return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); }
XOR
- cheapest possible way to reduce systematic lossXOR
has the best bit shuffling properties (on average produces 1/2 one-bits)OR
on average produces 3/4 one-bitsAND
on average produces 3/4 zero-bits
>>>
- in order to use highest bits in calculation of a bucket:tab[(n - 1) & hash]
- as a result, the modulo obtained has less collision
h | h (binary) | h % 32 | (h ^ h >>> 16) % 32 ------------: | :-------------------------------------: | -----: | ------------------: 65,537 | 0000 0000 0000 0001 0000 0000 0000 0001 | 1 | 0 131,073 | 0000 0000 0000 0010 0000 0000 0000 0001 | 1 | 3 262,145 | 0000 0000 0000 0100 0000 0000 0000 0001 | 1 | 5 524,289 | 0000 0000 0000 1000 0000 0000 0000 0001 | 1 | 1
- small number of collisions is virtually inevitable: https://github.com/mtumilowicz/hash-function
- note that instead of using list/tree to resolve hash collisions problem, there is another approach: open addressing
- open addressing basics
- all elements are stored in the hash table itself
- insert(k): find slot for k
- if empty - insert k
- if not empty - search right until an empty slot is found and insert k
- search(k): find slot for k
- if empty - not found
- if not empty and not equal to k - keep searching right
- if not empty and equal to k - found
- delete(k): we cannot perform simple deletion because the search may fail (we introduce gaps)
- marked slots as "deleted"
- we can insert an item in a deleted slot, but the search doesn’t stop at a deleted slot
- open addressing basics
getNode
tab[(n - 1) & hash]
- size have to be power of two:
2^n - 1
, binary representation: all 1 16 - 1 = 15
, binary representation:1111
- bitwise
AND
of any number with1111
-> last 4 bits of the number - it is equivalent to the modulo of the number by
16
- division operation is usually an expensive operation
- bitwise operation is usually preferred over division
- last 4 bits will evaluate to any number from 0 to 15
- suppose we use the size 17 instead
17 - 1 = 16
which is10000
in binary- bitwise AND with
16
-> lose all bits except the 5th bit from the end - regardless of the number, the bucket index will be either 16 or 0
- it means a lot of collisions and poor performance
- instead of
O(1)
for retrieval, you'd needO(log n)
- in case of
ConcurrentHashMap
in a multithreaded environment, you'd experience lot of synchronizations
- resize
- triggered when:
countElementsInBuckets() > countBuckets() * LOAD_FACTOR
- we cannot make decision to resize using concrete bucket size (for example - assuming that we resize when number of elements in some bucket exceeds given threshold), because there will always exist bad hash function that directs all elements into that very bucket
- triggered when:
- JEP 180: Handle Frequent HashMap Collisions with Balanced Trees: improve the performance of
HashMap
under high hash-collision conditions by using balanced trees rather than linked lists to store map entries - when a bucket becomes too big (
TREEIFY_THRESHOLD = 8
),HashMap
dynamically replaces list with a tree (using hash code as a branching variable)- if two hashes are different but ended up in the same bucket, one is considered bigger and goes to the right
- if hashes are equal,
HashMap
compares the keys (if possible)- it is a good practice when keys are
Comparable
- if keys are not comparable, no performance improvements in case of heavy hash collisions
- it is a good practice when keys are
- in some cases (mentioned above) with heavy hash collisions rather than pessimistic
O(n)
we get much betterO(log n)
- when we have proper hash function (or even ideal hash function) the LOAD_FACTOR guarantees that there will be only a tiny amount
of entries in each bucket, and it will be strictly limited for all - therefore we are allowed to say, that complexity of retrieving
element from a hashmap is
O(1)
- initial capacity
- resize (capacity factor) - how to reasonably define load factor?
- assumption: every hash function at least makes a best attempt to distribute hash codes
- growing based on total size keeps the collision lists at a reasonable size with realistic imperfect hash function
- what to do with
null
, any validations?null
keynull
value- note that there is no significant difference between getting key with null value and getting key that not exists
- drop or replace entry with the key that already exists in map?
get
should returnOptional
?- what to do in case of collision? list vs tree vs open addressing
- any threshold?
- thread safe vs not thread safe
- approaches to thread safety (global vs per-bucket synchronization)
- immutable or mutable?
- consequences - (HATM) - https://en.wikipedia.org/wiki/Hash_array_mapped_trie
- how to write tests
- you could unit-test implementation but tests should be easy to throw away
HashSet
usesHashMap
internally to store it’s objects- whenever we create a
HashSet
,HashMap
object associated with it is also created- elements you add into
HashSet
are stored as keys ofHashMap
object, value associated with those keys is a constant
- elements you add into
- from
HashSet
implementationprivate static final Object PRESENT = new Object(); public boolean add(E e) { return map.put(e, PRESENT)==null; } public boolean remove(Object o) { return map.remove(o)==PRESENT; }