General approach to dealing with growing dataset #2464

gjthompson1 · 2024-10-10T18:56:03Z

gjthompson1
Oct 10, 2024

Sanity check this statement For datasets that change in size over time. The happy path is to build and train the model once and then only ever add new records to existing clusters. Optionally one can occasionally re-cluster and deal with the downstream merges and splits. However any consumers of these ids need to be aware that they can change.

I was getting excited about my new beautiful clusters until I had to think about adding new records to my large dataset and then it all started falling apart.

Essentially it seems that Incremental clustering is hard. transitivity, bridges and managing cluster_ids are hard to deal with.

High level the problem is as follows.

I train a model and produce "de-duped" clusters of my data e.g. I had 100M records now I have 10M clusters. I want to add those to my database so I assign the cluster ID as the Primary key in my database.
Then some time later I want to add another 10M records and potentially many of the clusters change due to transitivity, bridges and my ids all change.

e.g

T=0 Model gives records (r1,r2,r3) as cluster_id = 1 and records (r4,r5,r6) as cluster_id = 2
T=1 Add new records (r7,r8,r9) add more training labels, retrain model, and re-run clustering. Model now gives (r1, r2) cluster_id = 1, (r3, r4, r5, r6, r7), cluster_id = 2, (r8,r9) with cluster_id = 3

So basically after adding new data the cluster compositions completely change. Even the ids change e.g. I can't rely on 2 still having some of the same elements 2 used to have because a new cluster may have formed or 2 may have joined. This makes it really hard to add data without completely "re-doing" everything and having to toss the ids from previous clustering runs. With applications with external ids you can't really do that. What if I have external applications referencing my cluster_ids even if I could update these they wont get the updates. What if external links exist such as https://app.mycompany.com/clusters/id now they are invalid. Even internally it creates challenges and head aches. e.g. these cluster_ids are foreign keys in 10 other tables. Managing the migration of these linkages is non trivial.

If you want to re-train or re-run clustering then you have to do the following.

If bridges form then you need to merge old ids and records (deleting (hard or soft) 1-many old ids)
If clusters break then you need to create new ids and records (creating 1-many new ids)

Questions/Problems

How do I map the old and new ids? Or is this even possible? Some measure of cluster composition?
How do I maintain stability in my clusters?
Should I prevent retraining / adding new labeled data that could end up break existing clusters?
Should I only link to existing clusters?
Any advice / recommendations where cluster id's are externally facing ids?

Apologies if my questions are overlapping and not coherent but hopefully this illustrates the challenges / problems. @RobinL or anyone with experience using Splink in a real time application it would be helpful to hear how people manage this / what the best approach is.

Related links

Related discussions

RobinL · 2024-10-26T19:16:44Z

RobinL
Oct 26, 2024
Maintainer

Thanks for the question. As you can probably guess, there are no perfect solutions here.

Fundamentally, I don't believe you can have:

Stable cluster IDs
Highest possible accuracy clusters (i.e. representing the best/latest information)

There is a special case where you only ever receive new records, and old records don't get deleted or updated. In that case, new records should only ever join existing clusters, and stable cluster IDs are possible. But even here, we have to assume we don't occasinoally retrain the model, which conflicts with (2).

With all that said, I would offer the following practical guidance. This is very much just my current thinking, so don't take it as gospel:

So long as you start with a large dataset, I think it's legitimate to train a single model, and then update it only occasionally (say, each year)
If you keep the same model, it should be more common that clusters merge than break apart. The 'culprits' for clusters breaking apart can only be (a) records being deleted and (b) records being updated
If a cluster breaks apart, there are two potential causes: (i) it was previously a false positive (ii) the record do in fact match, but the match score doesn't meet the threshold. If you cluster at a very high threshold, you may expect to see more of (ii) than (i). In some situations, you may even think that the deleted record, or the previous information in the updated record, was actually relevant to linking. If so, you may conclude that (i) is very rare, and so there is a case for 'disallowing' clusters to break apart. Or at least manually reviewing any that do.
The best underlying/root cause solution to all of these problems is generally to improve data quality or your linkage model! This will result in higher accuracy, and hence fewer problems of instability
One 'solution' sometimes used is to create a 'spine'. A deduplicated starting point with one record per person. And join all future records onto this 'spine'. This guarantees cluster staibility, because the ID of all clusters can just be equal to the 'starting ID' of each record in the spine. Whilst superficially attractive, this is pretty much the same solution as disallowing clusters to break apart, so doesn't give you a magic solution

Thinking specifically about IDs, one way to approach this is for your service to run using the original row IDs (i.e. the pre-link/pre-dedupe IDs). Those IDs continue to be used to refer to individual records, and the clustering/dedupe service simply returns the group of IDs in the cluster. In this setup, the cluster never has a 'meaningful' ID that is used by the business. But obviously this solution is not suitable in all contexts

Hope some of that helps - I'm interested in your thoughts and ideas because I'm not really sure about any of this.

1 reply

gjthompson1 Oct 29, 2024
Author

@RobinL thank you for your response. I appreciate your thoughts.

I think based on your comments I should.

Only add records to existing clusters
Re-train once a year and deal with downstream implications.

Otherwise its a bit of a mess.

I wonder if Splink could support a migration process by feeding in the old_cluster_id and just providing the "diff". The "diff" being. Essentially a mapper such as

# clusters_id_migration
- unique_id
- old_cluster_id
- new_cluster_id

I guess you can already do that by adding old_cluster_id to the underlying data. But I wonder if Splink could use some of the cluster metrics + old_clustre_id to try and limit new_cluster_id != old_cluster_id. The underlying issue is that the cluster_id's are all auto increment integers so if a single cluster is merged, deleted etc then this results in a shift of all the cluster ids and that diff becomes very large.

Notes

FWIW In terms of cluster stability old_cluster_id to new_cluster_id I made use of a "master" record that was chosen in a way to limit record migration between clusters. I didn't use https://moj-analytical-services.github.io/splink/topic_guides/evaluation/clusters/graph_metrics.html#node-degree but I wonder if something like that could be used. e.g. if a cluster on average has 100 underlying records then I can choose the master node as the one with the most edges so that its less likely that this master node changes due to CRUD or re-training. Once I had the master node I was able to migrate everything based on that.

FWIW my cluster migration process

My application backend is PostgreSQL but found the splink driver to not work well enough so just used DuckDB. Ended with a cluster table clusters_20241029 that I threw into postgresql

Schema

# public.record
- id # This is my external id
- cluster_id # Splink cluster id
- record_cluster_id # Foreign key to `public.record_cluster.id`

# public.record_cluster
- id # This is my external id
- cluster_id # Splink cluster id
- master_record_id # Intelligently chosen - 1 for each cluster

# splink.clusters_20241029
- cluster_id # splink generated cluster_id
- record_id # reference to public.record.id

Cluster migration process
I didn't include SQL queries as the steps would be too verbose..

Populate record.cluster_id from clusters_20241029 populated from cluster_pairwise_predictions_at_threshold
Populate existing record_cluster.master_record_id from intelligently selected master record.
Populate existing record_cluster.cluster_id from master_record.cluster_id
Create new record_cluster's where there are cluster_id without record_cluster records (at this point there should be a record_cluster for every cluster_id)
Update record.record_cluster_id WHERE record.cluster_id = record_cluster.cluster_id AND record.record_cluster_id != record_cluster.id (this actually migrates the underlying records to the correct cluster - we want to limit this.
Delete record_cluster's with no record's

IDK there is probably a better way to do this but thats roughly what I ended on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General approach to dealing with growing dataset #2464

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

General approach to dealing with growing dataset #2464

gjthompson1 Oct 10, 2024

Replies: 1 comment · 1 reply

RobinL Oct 26, 2024 Maintainer

gjthompson1 Oct 29, 2024 Author

gjthompson1
Oct 10, 2024

Replies: 1 comment 1 reply

RobinL
Oct 26, 2024
Maintainer

gjthompson1 Oct 29, 2024
Author