General approach to dealing with growing dataset #2464
Replies: 1 comment 1 reply
-
Thanks for the question. As you can probably guess, there are no perfect solutions here. Fundamentally, I don't believe you can have:
There is a special case where you only ever receive new records, and old records don't get deleted or updated. In that case, new records should only ever join existing clusters, and stable cluster IDs are possible. But even here, we have to assume we don't occasinoally retrain the model, which conflicts with (2). With all that said, I would offer the following practical guidance. This is very much just my current thinking, so don't take it as gospel:
Thinking specifically about IDs, one way to approach this is for your service to run using the original row IDs (i.e. the pre-link/pre-dedupe IDs). Those IDs continue to be used to refer to individual records, and the clustering/dedupe service simply returns the group of IDs in the cluster. In this setup, the cluster never has a 'meaningful' ID that is used by the business. But obviously this solution is not suitable in all contexts Hope some of that helps - I'm interested in your thoughts and ideas because I'm not really sure about any of this. |
Beta Was this translation helpful? Give feedback.
-
Sanity check this statement For datasets that change in size over time. The happy path is to build and train the model once and then only ever add new records to existing clusters. Optionally one can occasionally re-cluster and deal with the downstream merges and splits. However any consumers of these ids need to be aware that they can change.
I was getting excited about my new beautiful clusters until I had to think about adding new records to my large dataset and then it all started falling apart.
Essentially it seems that Incremental clustering is hard.
transitivity
,bridges
and managingcluster_ids
are hard to deal with.High level the problem is as follows.
transitivity
,bridges
and my ids all change.e.g
T=0
Model gives records(r1,r2,r3)
ascluster_id = 1
and records(r4,r5,r6)
ascluster_id = 2
T=1
Add new records(r7,r8,r9)
add more training labels, retrain model, and re-run clustering. Model now gives(r1, r2)
cluster_id = 1
,(r3, r4, r5, r6, r7)
,cluster_id = 2
,(r8,r9)
withcluster_id = 3
So basically after adding new data the cluster compositions completely change. Even the ids change e.g. I can't rely on
2
still having some of the same elements2
used to have because a new cluster may have formed or 2 may have joined. This makes it really hard to add data without completely "re-doing" everything and having to toss theids
from previous clustering runs. With applications with externalids
you can't really do that. What if I have external applications referencing mycluster_ids
even if I could update these they wont get the updates. What if external links exist such ashttps://app.mycompany.com/clusters/id
now they are invalid. Even internally it creates challenges and head aches. e.g. thesecluster_ids
are foreign keys in 10 other tables. Managing the migration of these linkages is non trivial.If you want to re-train or re-run clustering then you have to do the following.
Questions/Problems
Apologies if my questions are overlapping and not coherent but hopefully this illustrates the challenges / problems. @RobinL or anyone with experience using Splink in a real time application it would be helpful to hear how people manage this / what the best approach is.
Related links
Related discussions
Beta Was this translation helpful? Give feedback.
All reactions