Excluding matches based on an identifier #2208
Replies: 2 comments 2 replies
-
If I understand the question correctly I think you have two options:
import pandas as pd
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.linker import DuckDBLinker
records = [
{"unique_id": 1, "fname": "marylyn", "surname": "monroe", "npi": 1},
{"unique_id": 2, "fname": "marylyn", "surname": "monroe", "npi": 1},
{"unique_id": 3, "fname": "marylyn", "surname": "monroe", "npi": 2},
{"unique_id": 4, "fname": "marylyn", "surname": "monroe", "npi": None},
]
df = pd.DataFrame(records)
settings = {
"link_type": "dedupe_only",
"probability_two_random_records_match": 0.5,
"comparisons": [
cl.exact_match("fname"),
cl.exact_match("surname"),
cl.exact_match("npi"),
],
"blocking_rules_to_generate_predictions": [
"l.fname = r.fname and (l.npi = r.npi or l.npi is null or r.npi is null)",
],
"retain_matching_columns": True,
"retain_intermediate_calculation_columns": True,
}
DuckDBLinker(df, settings).predict().as_pandas_dataframe()
However, you can see there is a potential issue with clustering here which is that, since npi is null for unique_id=4, then you get a transitive match between e.g. unique id 2 and unique id 3 via unique_id 3. There's not much you can easily do about that because there's nothing wrong with the logic per-se. Both methods 1 and 2 have this problem. Note you'd need to add the additional |
Beta Was this translation helpful? Give feedback.
-
One other problem is that you are telling the model that in training that those overmatching records are truly positive records
] But then when it does consider those matches that actually not correct. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I have been using Splink successfully, but have come across a problem I am struggling with...
I have data coming from various sources, which I have been matching by email address and NPI number unique to each individual. I am trying to increase the matches by adding a blocking rule to include first name, last name, and profession. My problem is overmatching, when I know records should not match based on the NPI number.
In this sample, the records in yellow should not be matched based on the npi_number
I have tried the following code, which seemed to work on my training dataset (100k records) but is still overmatching on my full dataset (8 million records):
Here is the blocking chart:
Here is the weight chart:
Any feedback is appreciated!
Jeff Erickson
Beta Was this translation helpful? Give feedback.
All reactions