Add kmeans clustering based on ray #1080

bohou-aryn · 2024-12-18T18:06:21Z

This includes generally three steps:

materialize a document's embedding
initialize centroids randomly
iterate the kmeans process until converge, this is based on ray dataset map group and aggregate operators.

The result centroids could be used for downstream work.

austintlee · 2024-12-18T23:02:04Z

Don't we want the result of .kmeans() to be a DocSet? I'm not sure what to do with the resulting array of vectors.

In the summarization case, we could use kmeans/clustering for topic discovery and select a few elements from each cluster and use those to summarize. What I would need is membership info.

austintlee · 2024-12-18T23:05:08Z

Technically, this is K means and I am talking about clustering.

bohou-aryn · 2024-12-20T19:05:33Z

Technically, this is K means and I am talking about clustering.

The github comment seems not work, your comment could not be replied in dialogue. Anyway, the PR already shows how the clustering method works. Basically, you do a similar trick to assign each row a cluster id using a map function by comparing with the centroids, and then do either sort or aggregate for your purpose. In your summarization case, you might chain an aggregate function to select N entries out for each group.

bsowell

Can you fix linting and unit tests?

Any idea on how this performs? I don't need a full perf analysis, but just curious if it feels reasonable on 100 data points, 1000?

bsowell · 2024-12-20T19:33:02Z

lib/sycamore/sycamore/docset.py

@@ -903,6 +906,28 @@ def map(self, f: Callable[[Document], Document], **resource_args) -> "DocSet":
        mapping = Map(self.plan, f=f, **resource_args)
        return DocSet(self.context, mapping)

+    def kmeans(self, K: int, iterations: int, init_mode: str = "random", epsilon: float = 1e-4):


I get that it's different, but I do think we should have a method that returns a Docset with an extra field indicating that cluster that each row is assigned to. I know that it is just an extra map, but I expect it's how people would want to use it in practice and it makes things more chainable.

yes, sure, would add one.

bsowell · 2024-12-20T19:34:34Z

lib/sycamore/sycamore/docset.py

+
+        Args:
+            K: the count of centroids
+            iterations: the max iteration runs before converge


Is there a reasonable default for this? I at least wouldn't know what a good value to pick would be.

spark uses 20, we could follow the same, but it should really be a tuning process.

bsowell · 2024-12-20T19:44:09Z

How hard would it be to add a local mode version of this? I guess it would require a local-mode aggregate. That would be convenient, though I confess I'm not sure how important it is.

bohou-aryn · 2024-12-20T20:21:03Z

Can you fix linting and unit tests?

Any idea on how this performs? I don't need a full perf analysis, but just curious if it feels reasonable on 100 data points, 1000?

yes, would fix.

bohou-aryn · 2024-12-20T20:21:38Z

How hard would it be to add a local mode version of this? I guess it would require a local-mode aggregate. That would be convenient, though I confess I'm not sure how important it is.

would try to come up with one.

This includes generally three steps: 1. materialize a document's embedding 2. initialize centroids randomly 2. iterate the kmeans process until converge, this is based on ray dataset map group and aggregate operators. The result centroids could be used for downstream work.

bohou-aryn requested review from bsowell and dtecuci December 18, 2024 18:06

bohou-aryn force-pushed the kmeans branch from 3906f17 to 006b4fd Compare December 19, 2024 23:58

bsowell reviewed Dec 20, 2024

View reviewed changes

bohou-aryn force-pushed the kmeans branch from 006b4fd to 0bb7fec Compare December 23, 2024 18:48

bohou-aryn force-pushed the kmeans branch from 0bb7fec to a3e7258 Compare December 23, 2024 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kmeans clustering based on ray #1080

Add kmeans clustering based on ray #1080

bohou-aryn commented Dec 18, 2024

austintlee commented Dec 18, 2024

austintlee commented Dec 18, 2024

bohou-aryn commented Dec 20, 2024

bsowell left a comment

bsowell Dec 20, 2024

bohou-aryn Dec 20, 2024

bsowell Dec 20, 2024

bohou-aryn Dec 20, 2024

bsowell commented Dec 20, 2024

bohou-aryn commented Dec 20, 2024

bohou-aryn commented Dec 20, 2024

Add kmeans clustering based on ray #1080

Are you sure you want to change the base?

Add kmeans clustering based on ray #1080

Conversation

bohou-aryn commented Dec 18, 2024

austintlee commented Dec 18, 2024

austintlee commented Dec 18, 2024

bohou-aryn commented Dec 20, 2024

bsowell left a comment

Choose a reason for hiding this comment

bsowell Dec 20, 2024

Choose a reason for hiding this comment

bohou-aryn Dec 20, 2024

Choose a reason for hiding this comment

bsowell Dec 20, 2024

Choose a reason for hiding this comment

bohou-aryn Dec 20, 2024

Choose a reason for hiding this comment

bsowell commented Dec 20, 2024

bohou-aryn commented Dec 20, 2024

bohou-aryn commented Dec 20, 2024