This algorithm is an improvement to the k-means algorithm by using vectorization methods.
- Decide on a value for k.
- Initialize the k cluster centers.
- Decide the class memberships of the N objects by assigning them to the nearest cluster center.
- Re-estimate the k cluster centers, by assuming the memberships found above are correct.
- If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.
- Randomly select the first centroid from the data points.
- For each data point compute its distance from the nearest, previously chosen centroid.
- Select the next centroid from the data points such that the probability of choosing a point as centroid is directly proportional to its distance from the nearest, previously chosen centroid. (i.e. the point having maximum distance from the nearest centroid is most likely to be selected next as a centroid)
- Repeat steps 2 and 3 until k centroids have been sampled
- Valarray
- Move semantics
- References
- Vectorization:
Our implementation’s speed is approximately: 43 seconds (on my machine).
This is worse than the python implementation because we didnt use AVX instructions so we could not achive NumPy level of optimization.