-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathDataStructures.tex
575 lines (542 loc) · 38.7 KB
/
DataStructures.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
\ifx\PREAMBLE\undefined
\input{preamble}
\begin{document}
\fi
\chapter{Data Structures}
Data structures help us organize data so that it can be accessed quickly and usefully. Different data structures support different sets of operations, thus are suitable for different tasks.
\section{Heap}
A heap, also named a priority queue, is a container for objects with comparable keys. It should support at least two basic operations: insertion of new object, and extraction(i.e. removal) of the object with minimum\footnote{A heap can also support extraction of object with maximum key, but extract-min and extract-max cannot be supported simultaneously.} key. Both operations are expected to take $O(\log n)$ time. Typical heap implementations also support deletion of an object from the key, which is also $O(\log n)$. The construction of a heap, namely ``heapify'', takes $O(n)$ rather than $O(n\log n)$.
\subsection{Use Cases}
Heap can be used for sorting. First construct a heap with the $n$ items to be sorted, and then execute extract-min $n$ times. The process takes $O(n\log n)$ time, which is already the optimal running time for comparison based sorting algorithms.
We've already covered the use of a heap to accelerate Dijkstra's algorithm in the previous chapter.
An interesting use case of heap is median maintenance. We define the median of a sorted sequence of $n$ items $x_1,\dots,x_n$ to be $x_{(n+1)/2}$, for example $x_4$ for 8 items and $x_5$ for 9 items.
\begin{description}
\item[Input]A sequence of unsorted items $x_1,x_2,\dots,x_n$ provided one-by-one.
\item[Output]At each step $i$, calculate the median of $x_1,\dots,x_i$ in $O(\log i)$ time.
\end{description}
The problem can be solved using two heaps, as shown in Algorithm \ref{medianmaintenance}. For convenience, we assume that the heaps used here supports not only the extraction of min/max, but also checking the key value of the min/max without removing it.
\begin{algorithm}[ht]
\caption{Median Maintenance using Heaps}\label{medianmaintenance}
\begin{algorithmic}[1]
\InputOutput\Statex{see above}
\State{Initialize empty MaxHeap that supports extract-max}\Comment{Stores smaller half}
\State{Initialize empty MinHeap that supports extract-min}\Comment{Stores larger half}
\For{$i$ = 1 \textbf{to} $n$}
\If{$x_i<$ MaxHeap.checkMax()}\Comment{Should insert into smaller half}
\State{MaxHeap.insert($x_i$)}
\Else\Comment{insert into larger half}
\State{MinHeap.insert($x_i$)}
\EndIf
\If{MinHeap.size() - MaxHeap.size() == 2}\Comment{If unbalanced, balance the two halves}
\State{MaxHeap.insert(MinHeap.extractMin()}
\ElsIf{MaxHeap.size() - MinHeap.size() == 2}
\State{MinHeap.insert(MaxHeap.extractMax()}
\EndIf
\If{MinHeap.size() $>$ MaxHeap.size()}\Comment{Set median}
\State{median = MinHeap.checkMin()}
\Else\State{median = MaxHeap.checkMax()}
\EndIf
\EndFor
\end{algorithmic}
\end{algorithm}
\subsection{Implementation}
A heap can be conceptually thought of as a binary tree that is as complete as possible, i.e. null leaves are only allowed at the lowest level. The key of any node should be smaller than or equal to keys of its children, if there are any. This guarantees that the object at the root of the tree has the smallest key. This tree can be implemented as an array, with the root at the first position, and nodes at lower levels sequentially concatenated afterwards. If the array $A$ is 1-indexed, then the parent of $A[i]$ is $A[i/2]$, and the children of this node are $A[2i]$ and $A[2i+1]$.
With the array representation of heap, insertion can be implemented as follow:
\begin{itemize}
\item Put the new object at the end of the array.
\item As long as the key of the new object is smaller than that of its parent, bubble it up.
\end{itemize}
And extract-min can be implemented as follow:
\begin{itemize}
\item Remove the root.
\item Move the last object in the array to the first position.
\item As long as the key of the object at the root is larger than that of at least one of its children, sink it down. If the keys of both children are smaller, the child with smaller key should be used in the sink-down.
\end{itemize}
The height of the tree is $O(\log n)$, thus either bubble-up or sink-down can be executed at most $O(\log n)$ times, which guarantees that the two operations take $O(\log n)$ running time.
\section{Binary Search Tree}
Sorted array supports quick search of an element in $O(\log n)$ time, but it takes $O(n)$ time to insert or delete an element. Binary search tree is a data structure that supports both quick search and quick insertion / deletion.
\subsection{Basic Operations}
Each node of a BST contains the key and three pointers to other nodes: the left child, the right child and the parent. Some of the three pointers can be null. The most important property of BST is that for any node, all nodes in its left child has smaller keys than itself, while all nodes in its right key has larger keys. The height of a BST is at least $\log n$ and at most $n$. A \textbf{balanced} BST supports search, insertion and deletion in $O(\log n)$ time. But if it's not balanced, these operations can take as long as $O(n)$ time. Some of its basic operations are listed below.
\begin{description}
\item[search]In order to search for a node with a specific key value $k$:
\begin{itemize}
\item Start from the root node.
\item If a node is null or its key is equal to $k$, return this node.
\item If $k$ is smaller than its key, recursively search its left child.
\item If $k$ is larger than its key, recursively search its right child.
\end{itemize}
\item[insert]In order to insert a new node with key value $k$:
\begin{itemize}
\item Start from the root node.
\item If the node is null, make a new node here with key value $k$.
\item If $k$ is smaller than its key, go to its left child.
\item If $k$ is larger than its key, go to its right child.
\end{itemize}
\item[max]In order to obtain the node with the maximum key value:
\begin{itemize}
\item Start from the root node.
\item If the node has right child, go to its right child.
\item Return the node.
\end{itemize}
\item[min]Similar to max.
\item[successor]In order to obtain the successor of a node with key value $k$:
\begin{itemize}
\item If the node has right child, return the max of its right node.
\item Otherwise recursively go to its parent, until the key becomes larger than $k$.
\end{itemize}
\item[predecessor]Similar to successor.
\item[in order traversal]In order to traverse all nodes of a BST in order:
\begin{itemize}
\item Start from the root node.
\item If the node is null, stop.
\item Recursively traverse the left child.
\item Do something to the node, e.g. print its key.
\item Recursively traverse the right child.
\end{itemize}
\item[delete]In order to delete a node with key value $k$:
\begin{itemize}
\item Search for the node.
\item If it has no child, change it to null.
\item If it has 1 child, replace it with its child.
\item If it has 2 children, find its predecessor, who is guaranteed to have at most 1 child, and swap their keys. Then delete the node (currently at its predecessor's old position).
\end{itemize}
\end{description}
Sometimes a tree node can contain some information about the tree itself, for example the size of the subtree that uses this node as root. For each node $n$, we have
$$size(n) = size(n.left) + size(n.right) + 1.$$
With this information, we can find the node with the $i^{th}$ largest key among all nodes:
\begin{itemize}
\item Start from the root node.
\item If $size(n.left) = i - 1$, return the node.
\item If $size(n.left) > i - 1$, return the node with the $i^{th}$ largest key in the left subtree.
\item If $size(n.left) < i - 1$, return the node with the $(i-size(n.left)-1)^{th}$ largest key in the right subtree.
\end{itemize}
\subsection{Red-Black Tree}
The height of a BST can vary between $O(\log n)$ and $O(n)$. Balanced BSTs are guarantees to have $O(\log n)$ height, thus ensuring the efficiency of operations on it. Red-black trees is an implementation of balanced BST. In addition to the key and pointers to the parent and children, nodes in a red-black tree also stores a bit to indicate whether the node is red or black. The following conditions are satisfied by a red-black tree:
\begin{enumerate}
\item Each node is either red or black;
\item The root is black;\label{redblackcondition2}
\item There can never be two red nodes in a row, i.e. red nodes must have black parents and children;\label{redblackcondition3}
\item Every root $\rightarrow$ null path has the same number of black nodes.\label{redblackcondition4}
\end{enumerate}
\begin{theorem}
The height of a red-black tree with $n$ nodes is smaller than or equal to $O(2\log(n+1))$.
\end{theorem}
\begin{proof}
Suppose all root $\rightarrow$ null paths contain $k$ black nodes. Then the red-black contains at lest $k$ complete levels, because otherwise there would exist root $\rightarrow$ null paths with fewer than $k$ nodes, thus of cause fewer than $k$ black nodes. Therefore we have
$$n\geq 1 + 2 + \dots + 2^{k-1} = 2^k-1,$$
which means $k\leq\log(n+1).$ Suppose the height of the tree is $h$. According to condition \ref{redblackcondition3}, we hereby come to the conclusion
$$h\leq 2k\leq 2\log(n+1).$$
\end{proof}
An important idea in the implementation of red-black tree is rotation, as illustrated in Figure \ref{rotations}. It alters the structure of the tree in a way that makes the tree more balanced, whilst preserves the BST property.
\begin{figure}[ht]
\begin{subfigure}{\textwidth}
\centering
\begin{tikzpicture}
\tikzstyle{subtree} = [regular polygon, regular polygon sides = 4, draw]
\tikzstyle{singlenode} = [circle,draw]
\node[circle, draw](1) at (0,0){1}
child { node[subtree] {A} }
child { node [singlenode] {2}
child { node[subtree] {B} }
child { node[subtree] {C} }
};
\node[circle, draw](2)[right= 5cm of 1]{2}
child { node [singlenode] {1}
child { node[subtree] {A} }
child { node[subtree] {B} }
}
child { node[subtree] {C} };
\draw[->,very thick] (2.3,-1.5) -- (3.3,-1.5);
\end{tikzpicture}
\caption{left rotation}
\end{subfigure}\\
\begin{subfigure}{\textwidth}
\centering
\begin{tikzpicture}
\tikzstyle{subtree} = [regular polygon, regular polygon sides = 4, draw]
\tikzstyle{singlenode} = [circle,draw]
\node[singlenode](1) at (0,0){1}
child { node [singlenode] {2}
child { node[subtree] {A} }
child { node[subtree] {B} }
}
child { node[subtree] {C} };
\node[singlenode](2)[right= 3cm of 1]{2}
child { node[subtree] {A} }
child { node [singlenode] {1}
child { node[subtree] {B} }
child { node[subtree] {C} }
};
\draw[->,very thick] (1.3,-1.5) -- (2.3,-1.5);
\end{tikzpicture}
\caption{right rotation}
\end{subfigure}
\caption{Rotations in Red Black Tree}\label{rotations}
\end{figure}
Insertion and deletion in a red-black tree is carried out in two steps. First a normal BST insertion / BST is executed. It is probable that some of the conditions of red-black tree will be violated, thus we then modify the tree by recoloring the nodes and rotations in order to restore the conditions.
When we insert a node into the red-black tree as we do for any BST, we first try to color it as red. If condition \ref{redblackcondition3} is not violated, then everything is fine. Otherwise we wind up in two possible cases, as shown in Figure \ref{redblackinsertion}, in which $x$ is the newly inserted node.
In case 1, all we need to do is a recoloring of the nodes. The red node is propagated upwards, which may possibly induce another violation of \ref{redblackcondition3}. The process can last as much as $O(\log n)$ times until we reach the root. If the root is colored red, condition \ref{redblackcondition2} will be violated, and the solution is to color it back to black.
During the upward propagation process, it is possible that we meet case 2. Tackling this case is a little bit more complex, but it can be proven that the conditions can be restored via 2-3 rotations and recolorings in $O(1)$ time.
\begin{figure}[H]
\begin{subfigure}{.6\textwidth}
\centering
\begin{tikzpicture}
\tikzstyle{rednode}=[circle,draw,red]
\tikzstyle{blacknode}=[circle,draw]
\node[blacknode](w) at (0,0) {w}
child {node[rednode]{z}}
child {node[rednode]{y}
child{node[rednode,left]{x}}
};
\node[rednode](w1) [right = 3cm of w] {w}
child {node[blacknode]{z}}
child {node[blacknode]{y}
child{node[rednode,left]{x}}
};
\draw[->,very thick] (1.3,-1.5) -- (2.3,-1.5);
\end{tikzpicture}
\caption{case 1}
\end{subfigure}
\begin{subfigure}{.3\textwidth}
\centering
\begin{tikzpicture}
\tikzstyle{rednode}=[circle,draw,red]
\tikzstyle{blacknode}=[circle,draw]
\tikzstyle{subtree} = [regular polygon, regular polygon sides = 4, draw]
\node[blacknode](w2) {w}
child {node[blacknode]{z}}
child {node[rednode]{y}
child{node[rednode]{x}
child{node[subtree]{...}}
child{node[subtree]{...}}
}
child{node[subtree]{...}}
};
\end{tikzpicture}
\caption{case 2}
\end{subfigure}
\caption{Insertion in a Red-Black Tree}\label{redblackinsertion}
\end{figure}
\section{Hash Table}
\subsection{Concepts and Applications}
Hash table is a data structure designed to efficiently maintain a (possibly evolving) set of items, such as financial transactions, IP addresses, people associated with some data, etc. It supports insertion of a new record, deletion of existing records, and lookup of a particular record (like a dictionary). Assuming that the hash table is properly implemented, and that the data is non-pathological, all these operations can be executed in $O(1)$ time: amazingly fast!
Let's first introduce a few typical use cases of hash table before diving into its implementation.
Hash table can be used to solve the de-duplicates problem.
\begin{description}
\item[Input]A stream of objects.
\item[Output]Unique objects in the stream, i.e. the objects with all duplicates removed.
\end{description}
The problem arises when we want to record the number of unique visitors to a website, or when we want to remove duplicates in the result of a search. With a hash table on the objects implemented, we can solve the problem in linear time. Just examine the objects one by one. For each object $x$, do a lookup in the hash table $H$. If $x$ is not found in $H$, insert it into $H$ and append it to the result, otherwise just continue with the next object.
Another application is the 2-sum problem.
\begin{description}
\item[Input]An unsorted array $A$ of $n$ integers, and a target sum $t$.
\item[Output]Whether there exists two numbers $x,y\in A$ such that $x+y==t$.
\end{description}
A naive enumerative solution is $O(n^2)$. If we sort $A$ and then search for $t-x$ in $A$ for every $x\in A$, the time consumption can be reduced to $O(n\log n)$. But with hash table, the problem can be solved in merely $O(n)$ time. Just insert all items into the hash table, and then for each $x\in A$ check if $t-x$ is in $A$ via a hash table lookup.
In the early days of compilers, hash table was used to implemented symbol tables. The administrator of a network can use hash table to block certain IP addresses. When exploring huge game trees of chess or Go, hash table can be used to avoid duplicate explorations of the same configuration that can appear enormous times in the tree. In the last case, the size of the tree is so large that hash table is the only plausible method to record whether a configuration has been explored.
\subsection{Implementation}
When implementing hash table, we should think of a generally really big universe $U$ (e.g. all IP addresses, all names, all chessboard configurations, etc), of which we wish to maintain a evolving subset $S$ of reasonable size. The general approach is as follow.
\begin{enumerate}
\item Pick $n$ as the number of ``buckets''. $n$ should be of size comparable with $S$.
\item Choose a hash function $h:U\rightarrow\{0,1,2,\dots,n-1\}$.
\item Use array $A$ of length $n$ to store the items. $x$ should be stored in $A[h(x)]$.
\end{enumerate}
\begin{definition}
For a specific hash function $h$ on a universe $U$, we say there is a collision if $\exists$ distinct $x,y\in U$ such that $h(x)=h(y)$.
\end{definition}
Think of the famous ``same birthday'' problem: what's the number of people needed so that the probability for at least 2 of them to have the same birthday is more than 50\%? The answer is 23, which is quite a small number. This problem is an example to demonstrate that collisions are not unlikely to happen, and thus a good implementation of hash table must be able to resolve collision properly. There are two popular solutions:
\begin{description}
\item[Chaining]A linked list is kept in each bucket containing items with the correspondent hash value. Given an object $x$, an insertion / deletion / lookup is executed in the list $A[h(x)]$ when a correspondent operation is executed on the hash table with $x$.
\item[Open addressing]A bucket only stores one object. The hash function specifies a probe sequence $h_1(x),h_2(x),$ etc. When an object is inserted in to the hash table, the sequence will be followed until an empty slot is found. The sequence can be linear (i.e. slots are probed consecutively), or decided by two independent hash functions.
\end{description}
For a hash table with chaining, insertions are always $\Theta(1)$ because we simply insert a new element at the front of a list, while deletions and lookups are $\Theta(list\:length)$. The maximal length of a list can be anywhere from $m/n$, which means lengths of all lists are equal, to $m$, which means all objects are in the same bucket. The situation with open addressing is similar. Obviously, the performance of an implementation depends heavily on the choice of the hash function. A good hash function should lead to good performance, i.e. data should be spread out among all hash values, and the result of the hash function should be easy to evaluate and store.
A widely used method to define a hash function consists of two steps. First an object is transformed into a usually large integer, namely the hash code, and then the integer is mapped by a compression function to a number between $0$ and $n-1$, i.e. the index of a bucket. The $mod\:n$ function can serve as the compression function.
The number of buckets $n$ must be selected with caution. It should be a prime within a constant factor of the number of objects supposed to be saved in the table, and it should not be close to a power of 2 or 10.
\begin{definition}
The load factor $\alpha$ of a hash table is defined as
$$\alpha=\frac{\text{\# of objects in the hash table}}{\text{\# of buckets in the hash table}}.$$
\end{definition}
Obviously, for open addressing, $\alpha$ has to be smaller than 1, whereas chaining can cope with $\alpha<1$.
In general, $\alpha$ has to be $O(1)$ to guarantee constant running time for hash table operations. In particular, $\alpha\ll 1$ is expected for open addressing.
\subsection{Universal Hashing}
We wish to fabricate a clever hash function that can spread any data set quasi-evenly among all buckets. Unfortunately such function does not exist, because any hash function has a pathological data set. The reason is that for any hash function $h:U\rightarrow\{0,1,\dots,n-1\}$, according to the Pigeonhole Principle, there exists a bucket $i$ such that at least $\lvert U\rvert/n$ elements of $U$ hash to $i$ under $h$. If the data set is a subset of these elements, all of them will collide. This could become dangerous in real-world systems: a simple hash function can be reverse engineered and abused.
There are two solutions to this problem. Either a cryptographic hash function, e.g. SHA-2, should be used to make the reverse engineering infeasible, or a randomized approach should be taken: we should design a family $H$ of hash functions such that for any data set $S$, a randomly chosen function $h\in H$ is almost guaranteed to spread $S$ out quasi-evenly. Such a family of hash functions is called universal.
\begin{definition}
Let $H$ be a set of hash functions $h:U\rightarrow\{0,1,\dots,n-1\}$. $H$ is universal if and only if $\forall x,y\in U(x\neq y),$
$$P(h(x)=h(y))\leq \frac{1}{n},$$
in which $n$ is the number of buckets and $h$ is a hash function chosen uniformly at random from $H$. $1/n$ is actually the probably of a collision for pure random hashing.
\end{definition}
We will now provide a universal hash function family for IP addresses. Let $U$ represent the universe of all IP address of the form $(x_1,x_2,x_3,x_4)$, in which each $x_i$ is an integer between 0 and 255 inclusive. Let $n$ be a prime whose value is comparable with the number of objects in the hash table, and larger than 255. We define a hash function $h_a$ for each 4-tuple $a=(a_1,a_2,a_3,a_4)$ with each $a_i\in\{0,1,\dots,n-1\}:$
$$h_a(x_1,x_2,x_3,x_4)=\left(\sum\limits_{i=1}^4a_ix_i\right)\mod n.$$
Then the family of all $h_a$ is universal.
\begin{proof}
Consider two distinct IP addresses $x=(x_1,x_2,x_3,x_4)$, $y=(y_1,y_2,y_3,y_4)$, and assume that $x_4\neq y_4$. If $x$ and $y$ collide, we have
\begin{align*}
\left(\sum\limits_{i=1}^4a_ix_i\right)\mod n=\left(\sum\limits_{i=1}^4a_iy_i\right)\mod n\\
a_4(x_4-y_4)\mod n=\left(\sum\limits_{i=1}^3a_i(y_i-x_i)\right)\mod n\\
\end{align*}
For an arbitrarily fixed choice of $a_1,a_2,a_3$, the rhs is a fixed number between 0 and $n-1$ inclusive. With $x_4-y_4\mod n\neq 0$ (guaranteed by $n>255$ and $x_4\neq y_4$) and $a_4$ randomly chosen in $\{0,1,\dots,n-1\}$, the lhs is actually equally likely to be any of $\{0,1,\dots,n-1\}$. Therefore the probability of collision is $\frac{1}{n}$.
\end{proof}
Now we would like to verify the $O(1)$ running time guarantee of hash table implemented with chaining and hash function $h$ selected randomly from a universal family $H$. Here we assume that $\lvert S\rvert=O(n)$, i.e. $\alpha=\frac{\lvert S\rvert}{n}=O(1)$, and that it takes $O(1)$ to evaluate the hash function.
\begin{proof}
As discussed before, the running time of basic operations on a hash table implemented with chaining is $O(list\:length)$. So here we will try to prove that the expectation of the list length $L$ is $O(1)$.
For a specific list corresponding to hash value $h(x)$, we define
\begin{equation*}
Z_y=\begin{cases}
1\text{ if }h(y)=h(x)\\
0\text{ otherwise}\\
\end{cases}
\end{equation*}
for any $y\in S$. Then obviously $L=\sum\limits_{y\in S}Z_y$. Thus we have
\begin{align*}
E[L]&= \sum\limits_{y\in S}E[Z_y]=\sum\limits_{y\in S}P(h(y)=h(x))\\
&\leq\sum\limits_{y\in S}\frac{1}{n}=\frac{\lvert S\rvert}{n}=O(1).
\end{align*}
\end{proof}
The running time of operations on hash table implemented with open addressing is hard to analyze. We will use a heuristic assumption that all $n!$ probe sequences are equally possible, which is indeed not true but facilitates an idealized quick analysis. Under this heuristic assumption, the expected running time of operations is $\frac{1}{1-\alpha}$.
\begin{proof}
A random probe finds an empty slot with probability $1-\alpha$. A random probe sequence can be regarded as repetitions of random probes\footnote{Actually the probability for probes after the first probe to find an empty slot is larger, because we don't examine the same slot twice. The running time $\frac{1}{1-\alpha}$ is an upper bound.}. Thus the expectation of the number of probes needed for finding an empty slot is $\frac{1}{1-\alpha}$.
\end{proof}
For linear probing, the heuristic assumption is deadly wrong. So we assume instead that the initial probe is random, which is again not true in practice. Knuth proved in 1962 that the expected running time of an insertion under this assumption is $\frac{1}{(1-\alpha)^2}$.
\section{Bloom Filters}
Bloom filter is another data structure that facilitates fast insertions and lookups besides hash tables. It uses much less space than hash table, at the price of the following shortcomings:
\begin{itemize}
\item It cannot store associated objects;
\item It does not support deletions;
\item There is a small probability of false positive for the lookup result.
\end{itemize}
Historically, bloom filters are used to implement spell-checkers. A canonical use case is to forbid passwords of certain patterns. It is also used in network routers to complete tasks like banning certain IP addresses. It is desirable in such environment because memory is limited, the lookups are supposed to be super-fast and occasional false positives are tolerable.
Basic ingredients of a bloom filter includes an array $A$ of $n$ bits and $k$ hash functions $h_i,i=1,,2\dots,k$. To insert element $x$, we set $A[h_i(x)] = 1$ for all $i$. To do a lookup for $x$, we return true if we find that $A[h_i(x)]$ = 1 for all $i$. It is obvious that if all bits related to element $x$ via the $k$ hash functions have been set to 1 before $x$ itself is inserted, false positive will happen in a lookup for $x$. If the probability of false positive is too big, bloom filter should never be used. We will use heuristic analysis to illustrate that this probability is very small in reality.
We assume that across different $i$ and $x$, all $h_i(x)$ are uniformly random and independent. The assumption is generally not true, but it helps to understand the trade-off between space and error in bloom filters.
We would like to insert a data set $S$ into a bloom filter using $n$ bits. The probability that a certain bit has been set to 1 after inserting $S$ is
\begin{equation}\label{bloomfilterheuristic}
1- \left(1-\frac{1}{n}\right)^{k\lvert S\rvert}\approx 1-e^{-k\lvert S\rvert/n}=1-e^{-k/b},
\end{equation}
in which $b$ is the number of bits per object $\frac{n}{\lvert S\rvert}.$ Note that the approximation is only correct when $n$ is large, i.e. when $1/n$ is small. For an element not in $S$, the probability of a false positive is
\begin{equation*}
\epsilon=\left(1-e^{-k/b}\right)^k.
\end{equation*}
Let $t=-k/b$, then we have
\begin{align*}
\ln\epsilon&=-bt\ln(1-e^{-t})\\
\dv{\ln\epsilon}{t}&=-\frac{b}{e^t-1}\left(t+(e^t-1)\ln(1-e^{-t})\right).\\
\end{align*}
When $t=\ln 2$, $\dv{\ln\epsilon}{t}=0$ and $\epsilon$ gets minimized. Thus we should use $k=(ln 2)b\approx 0.693b.$ With $b=8$, we should choose $k=5$ or 6, and $\epsilon$ will be approximately 2\%.
\section[Union Find]{Union Find\protect\footnote{This topic was originally covered as an optional topic in Part 2. I put it here because it's a pure data structure topic that fits this chapter better.}}\label{unionfind}
Union find data structure maintains the partition of a set of objects. It supports two essential operations:
\begin{description}
\item[$find(x)$]Returns the name of the group to which $x$ belongs.
\item[$union(C_i, C_j)$]Merge groups $C_i,C_j$ into one group.
\end{description}
\subsection{Quick Find UF}
\begin{multicols}{2}
\begin{algorithmic}[1]
\Function{find}{x}
\State{return leader(x)}
\EndFunction
\end{algorithmic}
\columnbreak
\begin{algorithmic}[1]
\Function{union}{x,y}
\If{size(x) $>$ size(y)}
\For{i in y's group}
\State{leader(i) = x}
\EndFor
\State{size(x) += size(y)}
\Else
\For{i in x's group}
\State{leader(i) = y}
\EndFor
\State{size(y) += size(x)}
\EndIf
\EndFunction
\end{algorithmic}
\end{multicols}
In this implementation, a leader is chosen from each group arbitrarily. Each group is represented by its leader. Each object maintains a pointer to its leader. When two groups get merged, the leader of the larger group (i.e. the group that contains more objects) becomes the leader of the merged group.
$find()$ is easy for this implementation: just return its leader, which takes $O(1)$ time. However $union()$ takes $O(n)$ time, because all objects in the smaller group have to have the leader pointer updated. Nonetheless, if we consecutively merge groups so that in the end all objects are in the same group, the leader of each object is updated at most $\log n$ times, because each time an object has its leader updated, the size of the group to which it belongs at least doubles. Therefore in total there can be at most $O(n\log )$ leader updates.
\subsection{Quick Union UF}
\begin{multicols}{2}
\begin{algorithmic}[1]
\Function{find}{x}
\While{x $\neq$ parent(x)}
\State{x = parent(x)}
\EndWhile
\State{return x}
\EndFunction
\end{algorithmic}
\columnbreak
\begin{algorithmic}[1]
\Function{union}{x,y}
\State{lx = find(x), ly = find(y)}
\If{size(lx) $>$ size(ly)}
\State{parent(ly) = lx}
\State{size(lx) += size(ly)}
\Else
\State{parent(lx) = ly}
\State{size(ly) += size(lx)}
\EndIf
\EndFunction
\end{algorithmic}
\end{multicols}
In this implementation\footnote{It is different from the lazy union implementation in the lectures. It is as efficient as union by rank, but much easier to verify.}, each object maintains a pointer to its parent instead of its leader. Only the leader has itself as parent. In this way each group form a tree, with the leader as root. $find()$ follows the parent pointer of objects until the leader is met. $union()$ changes the parent of the leader of the smaller group into the leader of the larger group.
It can be proved by induction that a group with $k$ objects forms a tree of height no more than $\log k$.
\begin{proof}
The base case is trivial: each group has 1 object and height 0. Suppose $1\leq i\leq j$. When a group with $i$ objects is merged with a group with $j$ objects, the height of the new group is
$$h_{i+j}=h_i+1\leq \log i + 1 = \log(2i)\leq\log(i+j).$$
\end{proof}
Since the tree is at most of logarithmic height, the running time of $find()$ and $union()$ are both $O(\log n).$
\subsection{Union by Rank}
\begin{multicols}{2}
\begin{algorithmic}[1]
\Function{find}{x}
\While{x $\neq$ parent(x)}
\State{x = parent(x)}
\EndWhile
\State{return x}
\EndFunction
\end{algorithmic}
\columnbreak
\begin{algorithmic}[1]
\Function{union}{x,y}
\State{lx = find(x), ly = find(y)}
\If{rank(lx) $>$ rank(ly)}
\State{parent(ly) = lx}
\ElsIf{rank(lx) $<$ rank(ly)}
\State{parent(lx) = ly}
\Else\Comment{equal}
\State{parent(lx) = ly}
\State{rank(ly) += 1}
\EndIf
\EndFunction
\end{algorithmic}
\end{multicols}
$find()$ is the same as quick union. Each object maintains a rank field, which is initialized to 0 for all objects and can only increase in a merge of two trees whose roots have the same rank. In $union()$, rank instead of size is used to determine which root will serve as the root of the merged tree. It can be verified that union by rank also achieves $O(\log n)$ running time.
It follows immediately from the implementation of $union()$ that
\begin{enumerate}
\item For any object x, rank(x) can only increase over time.
\item Only ranks of roots can go up.
\item Along a path to the root, ranks strictly increase.
\end{enumerate}
\begin{lemma}\textbf{(Rank Lemma)}
After an arbitrary number of union operations, there are at most $n/2^r$ objects with rank $r$, in which $r\geq 0$.
\end{lemma}
\begin{proof}
First, if $x,y$ have the same rank $r$, then their subtrees must be disjoint. If they had a common node $z$, then path $z\rightarrow x$ and $z\rightarrow y$ would have to superpose with each other because each node has only one parent. It would become inevitable that one of $x$ and $y$ was an ancestor of the other, which is impossible because they have the same rank.
Then it can be verified that a rank-$r$ object has a subtree of size $\geq 2^r$. We will prove it by induction. In the base case, all subtrees are of rank $0$ and size 1. When two subtrees whose roots have different ranks merge, the situation is simple: no rank changes, while sizes become larger, hence the claim cannot be violated. When two subtrees $t_1, t_2$ whose roots have the same rank $r$ merge, the rank of the new root is $r+1$, and it is the only node whose rank changes. Since $t_1,t_2$ both have size $\geq 2^r$, the new tree must have size $\geq 2^{r+1}$, therefore the claim is observed.
The rank lemma follows directly from the two claims above.
\end{proof}
According to the rank lemma, there is at most 1 object with rank $\log n$, which can only be the root. Thus the tree is at most of height $\log n$, and the running time of $find()$ and $union()$ is $O(\log n)$.
\subsection{Path Compression}
If $find()$ operation is expected to be executed multiple times for each object, which is almost always the case, it makes no sense to repeat the same traversal job every time. Instead, we can make the parent pointer of all objects met during the process point to the root, i.e. the leader, so that later $find()$ operation on these objects take $O(1)$ time. This modification adds only a constant factor overhead to the first $find()$ on objects who are not direct children of the leader, and greatly speeds up subsequent $find()$ operations.
\begin{algorithmic}[1]
\Function{find}{x}
\State{leader = x}
\While{leader $\neq$ parent(leader)}
\State{leader = parent(leader)}
\EndWhile
\While{parent(x) $\neq$ leader}
\State{t = parent(x)}
\State{parent(x) = leader}
\State{x = t}
\EndWhile
\State{return leader}
\EndFunction
\end{algorithmic}
We will try to precisely analyze the performance enhancement of union by rank brought by path compression. Ranks are maintained exactly as without path compression. In this case, rank[x] is only an upper bound on the maximum number of steps along a path from a leaf to x. But the rank lemma still holds, and we still have rank(parent(x)) $>$ rank(x) for all non-root x.
\subsubsection{Hopcroft-Ullman's Analysis}
\begin{theorem} \textbf{(Hopcroft-Ullman Theorem)}
With union by rank and path compression, $m$ union + find operations take $O(m\log^*n)$ time, where
\begin{equation*}
\log^*n=\begin{cases}
0&if\:n\leq 1\\
1+\log^*(\log n)&if\:n>1
\end{cases}
\end{equation*}
\end{theorem}
We will focus on the case when $m=\Omega(n).$
\begin{proof}
First we divide the interval [0,n] into a few rank blocks:\{0\}, \{1\}, \{2,3,4\}, \{5,$\dots,2^4$\}, \{17,$\dots,2^16$\}, \{65537,$\dots,2^65536$\}, $\dots$, \{...$n$\}. In general, there are $O(\log^*n)$ rank blocks.
Consider a non-root object x, thus rank(x) is fixed. At a given time point, we call an object x good if one of the two conditions is satisfied:
\begin{itemize}
\item x or parent(x) is a root;
\item rank(parent(x)) is in a larger block than rank(x).
\end{itemize}
If neither condition is satisfied, we say x is bad.
A $find()$ operation can visit at most $O(\log^*n)$ good nodes (root, direct child of root + at most 1 in each rank block). In $m$ operation, these visits take $O(m\log^*n)$ time.
To compute the time consumption of visits to bad nodes, consider a rank block \{k+1, $\dots, 2^k$\}. Note that each time a bad node is visited, its parent is changed to another node (its then root) with strictly larger rank than its current parent. For a bad node x with rank(x) in this block, this process can happen at most $2^k$ times before x becomes good. Therefore the number of visits to x while x is bad and rank(x) is in this block is $\leq 2^k$. According to the rank lemma, the number of objects x with rank in this block is $\leq\sum\limits_{i=k+1}^{2^k}\frac{n}{2^i}<\frac{n}{2^k}$. Thus the total number of visits to bad objects in this block is $\leq 2^k\cdot\frac{n}{2^k}=n$. Since there are $O(\log^* n)$ blocks, the total time spent on visiting bad nodes is $O(n\log^*n)$.
In conclusion, the running time of $m$ operations is $O((m+n)\log^*n)$. Since we are interested in the case when $m=\Omega(n)$, it is equivalent to $O(m\log^*n)$.
\end{proof}
\subsubsection{Tarjan's Analysis}
Hopcroft-Ullman theorem already provides an upper bound quite close to linear running time. Yet Tarjan proved that there is an even better upper bound.
For integers $r\geq 1, k\geq 0$, the Ackermann function $A_k(r)$ is defined as
\begin{align*}
A_0(r)&=r+1\\
A_k(r)&=\underbrace{(A_{k-1}\circ A_{k-1}\circ\dots\circ A_{k-1})}_{r\text{ times}}(r),\:k\geq 1
\end{align*}
It's easy to derive that $A_1(r)=2r$, $A_2(r)=r2^r$. Because $A_2(r)$ is larger than $2^r$, $A_3(r)$ is larger than the result of applying $2^r$ $r$ times on $r$, i.e. the ``exponential tower'' of height $r$:
$$A_3(r)>{{2^2}^2}^{\dots r\:times\dots}.$$
Specifically, $A_1(2)=4$, $A_2(2)=8$, $A_3(2)=A_2(A_2(2))=A_2(8)=8\times 2^8=2048$. $A_4(2)=A_3(2048)$, which is larger than the exponential tower of height 2048.
For integer $n\geq 4$, we define the inverse Ackermann function
$$\alpha(n)=\text{ minimum value of $k$ such that }A_k(2)\geq n.$$
Since $A_k(2)$ blows up fast as $k$ gets larger, $\alpha(n)$ grows extremely slow. As a comparison:
\begin{equation*}
\alpha(n)=\begin{cases}
1&n=4\\
2&n=5,\dots,8\\
3&n=9,\dots,2048\\
4&n=2049,\dots,>{{2^2}^2}^{\dots 2048\:times\dots}\\
&\dots\\
\end{cases}
\log^*n=\begin{cases}
1&n=2\\
2&n=3,4\\
3&n=5,\dots,16\\
4&n=17,\dots,65536\\
5&n=65537,\dots,2^{65536}\\
\end{cases}
\end{equation*}
For $n={{2^2}^2}^{\dots 2048\:times\dots}$, $\log^*n=2048$ while $\alpha(n)=4$.
\begin{theorem} \textbf{(Tarjan's Theorem)}
With union by rank and path compression, $m$ union + find operations take $O(m\alpha(n))$ time.
\end{theorem}
In order to prove Hopcroft-Ullman's theorem, we used the fact that if parent(x) is updated from p to p', then rank(p') $\geq$ rank(p) + 1. To verify Tarjan's theorem, we will use a stronger version of this claim: in most cases rank(p') is much bigger than rank(p) (not just by 1).
\begin{proof}
Consider a non-root object x, thus rank(x) is fixed. Define
\begin{center}
$\delta(x)$ = max $k$ such that rank(parent(x)) $\geq$ $A_k($rank(x)).
\end{center}
As a few examples:\\
\begin{equation*}
\begin{cases}
\delta(x)\geq 0\iff \text{rank(parent(x))} \geq \text{rank(x) + 1 (always)}\\
\delta(x)\geq 1\iff \text{rank(parent(x))} \geq 2\cdot\text{rank(x)}\\
\delta(x)\geq 2\iff \text{rank(parent(x))} \geq \text{rank(x)}\cdot 2^{\text{rank(x)}}\\
\end{cases}
\end{equation*}
Note that for all object x with rank(x)$\geq 2$, we must have $\delta(x)\leq\alpha(n)$, because
$$A_{\alpha(n)}(\text{rank(x)})\geq A_{\alpha(n)}(2)\geq n\geq\text{rank(parent(x))}.$$
An object is defined as bad if \textbf{all} of the following conditions hold:
\begin{enumerate}
\item x is not a root;
\item parent(x) is not a root;
\item rank(x)$\geq$2;
\item x has an ancestor with $\delta(x)=\delta(y)$.
\end{enumerate}
Otherwise x is good. Along an object-root path, the maximum number of good objects is $\Theta(\alpha(n))$: 1 root, 1 direct child of root, 1 object with rank 0, 1 object with rank 1, 1 object x with $\delta(x)=k$ for each $k=0,1,\dots,\alpha(n)$. Thus the total number of visits to good objects is $O(m\alpha(n))$.
Consider the visit to a bad object x. x has an ancestor y with $\delta(x)=\delta(y)=k$. Suppose x's parent is p, y's parent is p', then we have
\begin{center}
rank(x's new parent) $\geq$ rank(p') $\geq A_k($rank(y)) $\geq A_k($rank(p))
\end{center}
The first and third $\geq$ are because rank only goes up from child to parent. The second $\geq$ comes from the definition of $\delta(y)$. This relation indicates that path compression at least applies the $A_k$ function to rank(x's parent). If r = rank(x), then after r such pointer updates, we have
\begin{center}
rank(parent(x)) $\geq\underbrace{(A_k\circ\dots\circ A_k)}_{\text{r times}}(r)=A_{k+1}(r).$
\end{center}
Hence, every r visits to x while x is bad increases $\delta(x)$. Because $\delta(x)\leq\alpha(n)$, there can be at most $r\alpha(n)$ visits to x while it's bad. Thus the total number of visits to bad objects is
\begin{equation*}
N(bad)\leq\sum\limits_{\text{x is bad}}rank(x)\alpha(n)\leq\alpha(n)\sum\limits_{r\geq 2}\frac{n}{2^r}<n\alpha(n)
\end{equation*}
In conclusion, the total running time is $O(m\log n)+O(n\log n)=O(m\log n)$.
\end{proof}
\ifx\PREAMBLE\undefined
\end{document}
\fi