Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMeans Optimization: Incorporating Weights into Re-Clustering Process #1001

Open
ChiaTrama opened this issue Oct 3, 2024 · 1 comment
Open

Comments

@ChiaTrama
Copy link

Hello,

As discussed in this topic on Dask's forum, my colleague and I compared in a distributed environment the dask-ml implementation of the KMeans class with our own implementation. During the comparison, we observed that the dask-ml initialization doesn't appear to use weights during the centroid re-clustering phase.

In the current dask-ml KMeans implementation, the standard KMeans algorithm is used for centroid re-clustering. In contrast, we incorporated weights into two areas:

  • KMeans++ initialization.
  • Weighted average during centroid re-clustering.

Although our implementation is less efficient than dask-ml in terms of execution time, we achieved better results when clustering a blob dataset, likely due to a reduction in the number of clustering iterations rather than direct code optimizations.

If you're interested, feel free to review our repository for further details on our approach:
GitHub Repository.

Thank you for considering this issue.

Best regards,
Chiara

@TomAugspurger
Copy link
Member

Thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants