Excluding matches based on an identifier #2208

jericksonclinicaloptions · 2024-06-12T16:24:52Z

jericksonclinicaloptions
Jun 12, 2024

Hello,

I have been using Splink successfully, but have come across a problem I am struggling with...

I have data coming from various sources, which I have been matching by email address and NPI number unique to each individual. I am trying to increase the matches by adding a blocking rule to include first name, last name, and profession. My problem is overmatching, when I know records should not match based on the NPI number.

In this sample, the records in yellow should not be matched based on the npi_number

I have tried the following code, which seemed to work on my training dataset (100k records) but is still overmatching on my full dataset (8 million records):

run_threshold = 0.9

run_comparisons = [
        ctl.name_comparison("first_name", term_frequency_adjustments=True),
        ctl.name_comparison("last_name", term_frequency_adjustments=True),
        ctl.email_comparison("primary_email_address"),
        cl.exact_match("npi_number", m_probability_exact_match=1, m_probability_else=0.0000001), # force a positive and negative match weight, to ensure we don't match any records with different NPI numbers
        cl.exact_match("city", term_frequency_adjustments=True),
        cl.exact_match("state_iso_code", term_frequency_adjustments=True),
        cl.exact_match("country_iso_code", term_frequency_adjustments=True),
        ctl.postcode_comparison("postal_code"),
        cl.exact_match("primary_profession", term_frequency_adjustments=True)
    ]

# skip estimating where we have defaults
columns_to_skip_estimates = [
    "npi_number"
]

# This is critical for performance, to reduce the number of pairwise comparisons
# https://moj-analytical-services.github.io/splink/topic_guides/blocking/predictions.html
run_prediction_blocking_rules = [
        block_on("npi_number"),
        block_on("primary_email_address"),
        block_on(["first_name", "last_name", "primary_profession"]),
        block_on(["first_name", "last_name", "substring(postal_code, 1, 5)"]) # this doesn't add any comparisons at 100,000 records...
    ]

run_deterministic_rules = [
    "lower(l.first_name) = lower(r.first_name) and lower(l.last_name) = lower(r.last_name)",
    "l.primary_email_address = r.primary_email_address",
    "l.npi_number = r.npi_number"
]

# Create dataset
table_path = f"{DATABASE_PATH}.{TRAINING_TABLE_NAME}"
df_training = spark.table(table_path)

# Log dataset
ds_training = mlflow.data.from_spark(df_training, path=table_path)
mlflow.log_input(ds_training, context='training')
     
# https://moj-analytical-services.github.io/splink/demos/tutorials/03_Blocking.html
settings = {
    "link_type": "dedupe_only",
    "unique_id_column_name": UNIQUE_ID_COLUMN_NAME,
    "comparisons": comparisons,
    "blocking_rules_to_generate_predictions": blocking_rules_to_generate_predictions,
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "em_convergence": 0.01
}

# Configure the catalog and database for temp tables
linker = SparkLinker(df_training, settings, spark=spark, catalog=CATALOG_NAME, database="splink")

# Estimate probabability two random records match
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)

# Estimate u with random sampling
linker.estimate_u_using_random_sampling(max_pairs=1e7)

# Training blocking rules - see https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html
# Unlike Prediction Rules, it does not matter if Training Rules excludes some true matches - it just needs to generate examples of matches and non-matches.
training_blocking_rule = "substr(l.first_name, 1,1) = substr(r.first_name, 1,1) and l.last_name = r.last_name"
print(f"Comparisons: {linker.count_num_comparisons_from_blocking_rule(training_blocking_rule)}")
linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule, comparisons_to_deactivate=columns_to_skip_estimates)

training_blocking_rule = "l.postal_code = r.postal_code and l.primary_profession = r.primary_profession"
print(f"Comparisons: {linker.count_num_comparisons_from_blocking_rule(training_blocking_rule)}")
linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule, comparisons_to_deactivate=columns_to_skip_estimates)

training_blocking_rule = "l.primary_email_address = r.primary_email_address"
print(f"Comparisons: {linker.count_num_comparisons_from_blocking_rule(training_blocking_rule)}")
linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule, comparisons_to_deactivate=columns_to_skip_estimates)

training_blocking_rule = "l.npi_number = r.npi_number"
print(f"Comparisons: {linker.count_num_comparisons_from_blocking_rule(training_blocking_rule)}")
linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule, comparisons_to_deactivate=columns_to_skip_estimates)

training_blocking_rule = "l.postal_code = r.postal_code"
print(f"Comparisons: {linker.count_num_comparisons_from_blocking_rule(training_blocking_rule)}")
linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule, comparisons_to_deactivate=columns_to_skip_estimates)

# Get the run id
run_name = run.info.run_name.replace("-", "_")

# Generate predictions
predictions = linker.predict(threshold_match_probability=threshold)
df_predictions = predictions.as_spark_dataframe()
#df_predictions.write.mode("overwrite").saveAsTable(f"{table_path}_prediction_{run_id}")

# Generate clusters
clusters = linker.cluster_pairwise_predictions_at_threshold(predictions, threshold)
df_clusters = clusters.as_spark_dataframe()
df_clusters.write.mode("overwrite").saveAsTable(f"{table_path}_cluster_{run_name}")

Here is the blocking chart:

Here is the weight chart:

Any feedback is appreciated!

Jeff Erickson

RobinL · 2024-06-14T12:27:47Z

RobinL
Jun 14, 2024
Maintainer

If I understand the question correctly I think you have two options:

Write your blocking rules to filter out the record comparisons you don't want:

import pandas as pd
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.linker import DuckDBLinker

records = [
    {"unique_id": 1, "fname": "marylyn", "surname": "monroe", "npi": 1},
    {"unique_id": 2, "fname": "marylyn", "surname": "monroe", "npi": 1},
    {"unique_id": 3, "fname": "marylyn", "surname": "monroe", "npi": 2},
    {"unique_id": 4, "fname": "marylyn", "surname": "monroe", "npi": None},
]
df = pd.DataFrame(records)
settings = {
    "link_type": "dedupe_only",
    "probability_two_random_records_match": 0.5,
    "comparisons": [
        cl.exact_match("fname"),
        cl.exact_match("surname"),
        cl.exact_match("npi"),
    ],
    "blocking_rules_to_generate_predictions": [
        "l.fname = r.fname and (l.npi = r.npi or l.npi is null or r.npi is null)",
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}


DuckDBLinker(df, settings).predict().as_pandas_dataframe()

Use a post-linking filter

import pandas as pd

import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.linker import DuckDBLinker

records = [
    {"unique_id": 1, "fname": "marylyn", "surname": "monroe", "npi": 1},
    {"unique_id": 2, "fname": "marylyn", "surname": "monroe", "npi": 1},
    {"unique_id": 3, "fname": "marylyn", "surname": "monroe", "npi": 2},
    {"unique_id": 4, "fname": "marylyn", "surname": "monroe", "npi": None},
]
df = pd.DataFrame(records)
settings = {
    "link_type": "dedupe_only",
    "probability_two_random_records_match": 0.5,
    "comparisons": [
        cl.exact_match("fname"),
        cl.exact_match("surname"),
        cl.exact_match("npi"),
    ],
    "blocking_rules_to_generate_predictions": [
        "l.fname = r.fname",
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings)
predictions = linker.predict()

sql = f"""
select *
from {predictions.physical_name}
where 
(npi_l = npi_r or npi_l is null or npi_r is null) 
"""

predictions_filtered = linker.query_sql(sql, output_type="splinkdf")
predictions_filtered.as_pandas_dataframe()
linker.cluster_pairwise_predictions_at_threshold(
    predictions_filtered, threshold_match_probability=0.5
).as_pandas_dataframe()

However, you can see there is a potential issue with clustering here which is that, since npi is null for unique_id=4, then you get a transitive match between e.g. unique id 2 and unique id 3 via unique_id 3.

There's not much you can easily do about that because there's nothing wrong with the logic per-se. Both methods 1 and 2 have this problem.

Note you'd need to add the additional (l.npi = r.npi or l.npi is null or r.npi is null) condition to all blocking rules to use the first option, which is probably my recommended approach

2 replies

jericksonclinicaloptions Jun 18, 2024
Author

Thanks for the response.

I am now wondering if my need is to be able to do deterministic matching at the column level, overriding any probabilistic matching. An example would be a US Social Security number... since they are effectively unique, 2 records with the same value almost certainly indicates the same individual, and 2 records with different values indicates different individuals (assuming data is accurate). Transitive matches should not result in different SSN's in the same cluster.

Have you considered being able to specify deterministic columns in probabilistic matching mode, or is this something Splink can currently do?

RobinL Jun 18, 2024
Maintainer

in terms of deterministic columns you have two options:

use a post-linking query with e.g. a case statement to override the Splink model predictions
manually adjust the match weights on ssn post training to put them to effectively infinite strength for match and negative infinite for a non match
There s an example of 2 here

There's no easy way of solving this problem:

Transitive matches should not result in different SSN's in the same cluster.
It's really a graph theory problem, and a hard one, even in the simple case you describe. But even harder for larger clusters where the transitivity may be more complex.

vfrank66 · 2024-06-22T02:35:47Z

vfrank66
Jun 22, 2024

One other problem is that you are telling the model that in training that those overmatching records are truly positive records

"lower(l.first_name) = lower(r.first_name) and lower(l.last_name) = lower(r.last_name)",

run_deterministic_rules = [
"lower(l.first_name) = lower(r.first_name) and lower(l.last_name) = lower(r.last_name)",
"l.primary_email_address = r.primary_email_address",
"l.npi_number = r.npi_number"

]

But then when it does consider those matches that actually not correct.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excluding matches based on an identifier #2208

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Excluding matches based on an identifier #2208

jericksonclinicaloptions Jun 12, 2024

Replies: 2 comments · 2 replies

RobinL Jun 14, 2024 Maintainer

jericksonclinicaloptions Jun 18, 2024 Author

RobinL Jun 18, 2024 Maintainer

vfrank66 Jun 22, 2024

jericksonclinicaloptions
Jun 12, 2024

Replies: 2 comments 2 replies

RobinL
Jun 14, 2024
Maintainer

jericksonclinicaloptions Jun 18, 2024
Author

RobinL Jun 18, 2024
Maintainer

vfrank66
Jun 22, 2024