Implement k-cluster with outliers streaming algorithm.
It contains the implementation of two algorithms from these papers:
-
Section 3 from Algorithms for Facility Location Problems with Outliers, Moses Charikar, Samir Khuller, David M. Mount, Giri Narasimhan acm.
-
Algorithm 3.1 from Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity, Richard Matthew McCutchen, Samir Khuller springer. Where the requirement to an offline 4-approximation algorithm has been replaced with the above algorithm.
The core code only requires a C++17/OpenMP capable compilator and the library jsoncpp.
Building maps and graphs requires python3 together with libraries glob, pillow and matplotlib.
https://github.com/remi-dupre/cluster-outliers.git
cd cluster-outliers
make release
# Use every logical threads as a default, but it seems to behave better without
# any hyperthreading going on.
export OMP_NUM_THREADS=4
# Cover the graph saved in dataset.txt using 50 clusters and 10 outliers.
./clustering dataset.txt 50 10
The input file must contain one vertex per line, beeing described with three columns separated with spaces. The first columns is ignored, and the second and last columns contains longitude and latitude of the point.
# Build 2D-maps in the directory maps/ from existing logs
scripts/map.py
# Plot a benchmark about existing logs
scripts/plot.py
All tests were run on this dataset containing the geolocalisation of 1,000,000 tweets around the world.
Here are measures for several parameters:
- clusters: the number of allowed clusters in the solution
- outliers: number of allowed outliers
- alpha: disks growth parameter in the algorithm
- cost: max radius of the solution
- min opt: a lower bound to the optimal solution
- max opt: an upper bound to the optimal solution
- time: run time of the algorithm
clusters | outliers | α | cost | min opt | max ratio | time |
---|---|---|---|---|---|---|
20 | 10 | 1.1 | 0.8282 | 0.3125 | 2.6502 | 2.6360s |
20 | 50 | 1.1 | 0.6900 | 0.1992 | 3.4637 | 18.3627s |
20 | 10 | 4 | 1.4160 | 0.3125 | 4.5312 | 1.9803s |
20 | 20 | 4 | 0.8843 | 0.2666 | 3.3169 | 2.4288s |
20 | 30 | 4 | 0.8665 | 0.2373 | 3.6514 | 2.4714s |
20 | 40 | 4 | 1.3749 | 0.2178 | 6.3136 | 2.6495s |
20 | 50 | 4 | 1.2991 | 0.1992 | 6.5210 | 3.1103s |
For parameters clusters = 20, outliers = 50 and α = 1.1 the output looks like:
The bound on the optimal is computed by finding (clusters + outliers + 1) disks that don't intersect, this is how it looks like using previous parameters: