Can't produce any evaluation charts from saved model #2449

AdamChapnik1 · 2024-10-04T17:01:23Z

AdamChapnik1
Oct 4, 2024

Hi!

I'm trying to produce some evaluation charts (like the unlinkables chart) for a Splink model I ran and saved, but instead the code is running for ages and nothing is coming out. I would usually include a reproducible example, but I'm not getting any errors and the same code in fact works for other models I have run — and very quickly — it's just that this model in particular can't produce any evaluation charts.

There doesn't seem to be any substantial difference between the saved Splink models for which I can produce the evaluation charts and for which I cannot.

The code looks like this. It's meant to save all the evaluation charts as JSONs for several saved Splink models that all have separate associated DuckDB files containing the records that were linked. (Across the DuckDB connections, the tables all happen to have the same names because I partitioned a large national dataset of US addresses by state before running the same Splink model training code for each partition).

from splink import DuckDBAPI, Linker

import duckdb as duckdb
import altair as alt

import os
from os import listdir
from os.path import isfile, join

dir = "path/to/several/duckdb/files"
# all the duckdb files are formatted like "[st]_duckdb.db" where "st" is a state abbreviation
db = [f for f in listdir(dir) if isfile(join(dir, f))]

for m in db:
    con = duckdb.connect(f'{dir}/{m}')
    con.execute("SET threads = 8;") 
   
    linker = Linker(
        ['dfA', 'dfB'],
        # "m[:2]" is equivalent to "st" above
        settings = f'path/to/models/{m[:2]}_model.json',
        db_api = DuckDBAPI(con)
                    )

    chart = linker.evaluation.unlinkables_chart()
    chart.save(f'data/figs/{m[:2]}_unlinkables.json')

    chart = linker.visualisations.match_weights_chart()
    chart.save(f'data/figs/{m[:2]}_match_weights.json')

    chart = linker.visualisations.m_u_parameters_chart()
    chart.save(f'data/figs/{m[:2]}_m_u_parameters.json')

    con.close()

I am running this code on a machine with 8 cores and 90G of RAM. Some of the charts are created and saved in less than a few seconds, whereas the ones I'm having trouble with are taking an indefinite amount of time, and I have had this running for hours. To be clear, I know that there aren't any errors being suppressed by the for-loop because I ran the code for the problematic Splink models outside of the for-loop.

My first question is whether there is anything obviously wrong with this code?

My second question is whether I should share the JSON for the saved model that is giving me trouble?

Thank you!

RobinL · 2024-10-07T19:20:52Z

RobinL
Oct 7, 2024
Maintainer

Not sure. The unlinkables chart is the only one that does any calculations against the backend database.

TO debug, the first thing I'd probably do is see whether you still get the error on smaller datasets e.g. if you do

create or replace view dfA_100 as select * from dfA limit 100 and then pass dfA_100 into splink

The other thing I'd look at is turning on logging at a lower level to see the SQL that's causing the problem.

linker = Linker(data, settings, con)

import logging 
logging.basicConfig(format="%(message)s")
logging.getLogger("splink").setLevel(1)

Finally - you can probably post your settings here. They will reveal the name of the columns you're using in your script, and how you've set the model up, but the settings.json doesn't contain any actual data from your input datasets

5 replies

AdamChapnik1 Oct 14, 2024
Author

I've tried to debug what I can but I am not making much progress so if it you might be able to make more sense of this I'm dropping here all of the information I have now. I would really appreciate any insights.

Here are the settings:

{'link_type': 'dedupe_only',
 'probability_two_random_records_match': 5.580857250187588e-07,
 'retain_matching_columns': False,
 'retain_intermediate_calculation_columns': False,
 'additional_columns_to_retain': ['addid'],
 'sql_dialect': 'duckdb',
 'linker_uid': 'zw7adf0n',
 'em_convergence': 0.0001,
 'max_iterations': 25,
 'bayes_factor_column_prefix': 'bf_',
 'term_frequency_adjustment_column_prefix': 'tf_',
 'comparison_vector_value_column_prefix': 'gamma_',
 'unique_id_column_name': 'addid',
 'source_dataset_column_name': 'source_dataset',
 'blocking_rules_to_generate_predictions': [{'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.scity = r.scity AND l.szip5 = r.szip5',
   'sql_dialect': 'duckdb'},
  {'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.scity = r.scity',
   'sql_dialect': 'duckdb'},
  {'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.szip5 = r.szip5',
   'sql_dialect': 'duckdb'},
  {'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr AND substr(l.szip5,1,4) = substr(r.szip5,1,4)',
   'sql_dialect': 'duckdb'},
  {'blocking_rule': 'l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.scity = r.scity AND l.szip5 = r.szip5',
   'sql_dialect': 'duckdb'},
  {'blocking_rule': 'l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.scity = r.scity',
   'sql_dialect': 'duckdb'},
  {'blocking_rule': 'l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.szip5 = r.szip5',
   'sql_dialect': 'duckdb'},
  {'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr',
   'sql_dialect': 'duckdb'},
  {'blocking_rule': 'l.saddno = r.saddno AND l.scity = r.scity',
   'sql_dialect': 'duckdb'},
  {'blocking_rule': 'l.saddno = r.saddno AND l.szip5 = r.szip5',
   'sql_dialect': 'duckdb'}],
 'comparisons': [{'output_column_name': 'street',
   'comparison_levels': [{'sql_condition': '(saddstr_l IS NULL OR saddstr_r IS NULL) AND (geohash_self_and_neighbors7_l IS NULL OR geohash_self_and_neighbors7_r IS NULL OR accuracy_l IS NULL OR accuracy_r IS NULL)',
     'label_for_charts': 'Null',
     'is_null_level': True},
    {'sql_condition': '(saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))',
     'label_for_charts': 'Exact match (saddstr) and Intersect (geohash 7)',
     'm_probability': 0.5184678962113823,
     'u_probability': 1.0436458272744132e-05},
    {'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))',
     'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.92 and Intersect (geohash 7)',
     'm_probability': 0.011895738173284133,
     'u_probability': 5.523632562573723e-07},
    {'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))',
     'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.88 and Intersect (geohash 7)',
     'm_probability': 0.019878517576250577,
     'u_probability': 6.314759290890734e-07},
    {'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))',
     'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.84 and Intersect (geohash 7)',
     'm_probability': 0.018633469905727222,
     'u_probability': 4.3832697109455996e-07},
    {'sql_condition': '(saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))',
     'label_for_charts': 'Exact match (saddstr)',
     'm_probability': 0.3320169362895465,
     'u_probability': 0.0001493148354314929},
    {'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))',
     'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.92',
     'm_probability': 0.019878741014857884,
     'u_probability': 8.614514799233525e-05},
    {'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))',
     'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.88',
     'm_probability': 0.03469364631438866,
     'u_probability': 0.0003068737815534748},
    {'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))',
     'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.84',
     'm_probability': 0.04372303645073314,
     'u_probability': 0.0005098384127933774},
    {'sql_condition': 'array_length(list_intersect(geohash_self_and_neighbors8_l, geohash_self_and_neighbors8_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL)',
     'label_for_charts': 'Intersect (geohash 8)',
     'm_probability': 1.7211818229633315e-06,
     'u_probability': 1.211635529854881e-07},
    {'sql_condition': 'array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL)',
     'label_for_charts': 'Intersect (geohash 7)',
     'm_probability': 0.00014098268683681274,
     'u_probability': 2.168827598440237e-06},
    {'sql_condition': 'ELSE',
     'label_for_charts': 'All other comparisons',
     'm_probability': 0.0006693141951697755,
     'u_probability': 0.9989334792066487}],
   'comparison_description': 'CustomComparison'},
  {'output_column_name': 'scity',
   'comparison_levels': [{'sql_condition': '"scity_l" IS NULL OR "scity_r" IS NULL',
     'label_for_charts': 'scity is NULL',
     'is_null_level': True},
    {'sql_condition': '"scity_l" = "scity_r"',
     'label_for_charts': 'Exact match on scity',
     'm_probability': 0.95787816054219,
     'u_probability': 0.04709747703972181,
     'tf_adjustment_column': 'scity',
     'tf_adjustment_weight': 1.0},
    {'sql_condition': 'ELSE',
     'label_for_charts': 'All other comparisons',
     'm_probability': 0.04212183945781009,
     'u_probability': 0.9529025229602782}],
   'comparison_description': 'ExactMatch'},
  {'output_column_name': 'szip5',
   'comparison_levels': [{'sql_condition': 'szip5_l IS NULL OR szip5_r IS NULL',
     'label_for_charts': 'Null',
     'is_null_level': True},
    {'sql_condition': 'szip5_l = szip5_r',
     'label_for_charts': 'Exact match',
     'm_probability': 0.9807581582000615,
     'u_probability': 0.003993214383316761,
     'tf_adjustment_column': 'szip5',
     'tf_minimum_u_value': 0.001,
     'tf_adjustment_weight': 1.0},
    {'sql_condition': 'substr(szip5_l, 1, 4) = substr(szip5_r, 1, 4)',
     'label_for_charts': 'Exact match (zip4)',
     'm_probability': 0.005447918971943777,
     'u_probability': 0.01455918102155314,
     'tf_adjustment_column': 'szip5',
     'tf_minimum_u_value': 0.001,
     'tf_adjustment_weight': 0.5},
    {'sql_condition': 'substr(szip5_l, 1, 3) = substr(szip5_r, 1, 3)',
     'label_for_charts': 'Exact match (zip3)',
     'm_probability': 0.013793922827954863,
     'u_probability': 0.058049490891291536,
     'tf_adjustment_column': 'szip5',
     'tf_minimum_u_value': 0.001,
     'tf_adjustment_weight': 0.5},
    {'sql_condition': 'damerau_levenshtein(szip5_l, szip5_r) <= 2',
     'label_for_charts': 'Damerau-Levenshtein <= 2',
     'm_probability': 3.460899719747275e-14,
     'u_probability': 0.11638742210989385,
     'tf_adjustment_column': 'szip5',
     'tf_minimum_u_value': 0.001,
     'tf_adjustment_weight': 0.5},
    {'sql_condition': 'ELSE',
     'label_for_charts': 'All other comparisons',
     'm_probability': 5.30903997609798e-15,
     'u_probability': 0.8070106915939447}],
   'comparison_description': 'CustomComparison'}]}

I am able to reduce one of the tables (source dataset 'r') to the first 100 rows, change the settings to dedupe_only, and successfully create the unlinkables chart.

Here is the logging of that successful SQL query

SQL pipeline was passed inputs [] and output dataset __splink__df_concat_with_tf
    Pipeline part 1: CTE reads tables [ regrid_sample ] and has output table name: __splink__df_concat
    Pipeline part 2: CTE reads tables [ __splink__df_concat ] and has output table name: __splink__df_tf_szip5
    Pipeline part 3: CTE reads tables [ __splink__df_concat ] and has output table name: __splink__df_tf_scity
    Pipeline part 4: CTE reads tables [ __splink__df_tf_scity, __splink__df_tf_szip5, __splink__df_concat ] and has output table name: __splink__df_concat_with_tf
Executing sql to create __splink__df_concat_with_tf as physical name __splink__df_concat_with_tf_4b5d5ed6b

------Start SQL---------
CREATE TABLE __splink__df_concat_with_tf_4b5d5ed6b AS 
WITH

__splink__df_concat as (
            select "addid", "saddno", "saddpref", "saddstr", "saddsttyp", "saddstsuf", "sunit", "scity", "szip5", "accuracy", "geohash_self_and_neighbors8", "geohash_self_and_neighbors7", "source"
            , random() as __splink_salt
            from regrid_sample
            ), 

__splink__df_tf_szip5 as (
    select
    "szip5", cast(count(*) as float8) / (select
        count("szip5") as total from __splink__df_concat)
            as "tf_szip5"
    from __splink__df_concat
    where "szip5" is not null
    group by "szip5"
    ), 

__splink__df_tf_scity as (
    select
    "scity", cast(count(*) as float8) / (select
        count("scity") as total from __splink__df_concat)
            as "tf_scity"
    from __splink__df_concat
    where "scity" is not null
    group by "scity"
    ) 

    select __splink__df_concat.*, __splink__df_tf_szip5."tf_szip5", __splink__df_tf_scity."tf_scity"
    from __splink__df_concat
    left join __splink__df_tf_szip5 on __splink__df_concat."szip5" = __splink__df_tf_szip5."szip5" left join __splink__df_tf_scity on __splink__df_concat."scity" = __splink__df_tf_scity."scity"
    
-------End SQL-----------

Setting cache for __splink__df_concat_with_tf_4b5d5ed6b with physical name __splink__df_concat_with_tf_4b5d5ed6b
Setting cache for __splink__df_concat_with_tf with physical name __splink__df_concat_with_tf_4b5d5ed6b
SQL pipeline was passed inputs [__splink__df_concat_with_tf_4b5d5ed6b] and output dataset __splink__blocked_id_pairs
    Pipeline part 1: CTE reads tables [ __splink__df_concat_with_tf_4b5d5ed6b ] and has output table name: __splink__df_concat_with_tf
    Pipeline part 2: CTE reads tables [ __splink__df_concat_with_tf AS l, __splink__df_concat_with_tf AS r ] and has output table name: __splink__blocked_id_pairs
Executing sql to create __splink__blocked_id_pairs as physical name __splink__blocked_id_pairs_df367c62d

------Start SQL---------
CREATE TABLE __splink__blocked_id_pairs_df367c62d AS 
WITH

__splink__df_concat_with_tf as (
select * from __splink__df_concat_with_tf_4b5d5ed6b) 

            select
            '0' as match_key,
            l."addid" as join_key_l,
            r."addid" as join_key_r
            from __splink__df_concat_with_tf as l
            inner join __splink__df_concat_with_tf as r
            on
            (l."addid" = r."addid")
             where 1=1 
            
            
-------End SQL-----------

Setting cache for __splink__blocked_id_pairs_df367c62d with physical name __splink__blocked_id_pairs_df367c62d
SQL pipeline was passed inputs [__splink__blocked_id_pairs_df367c62d, __splink__df_concat_with_tf_4b5d5ed6b] and output dataset __splink__df_self_link
    Pipeline part 1: CTE reads tables [ __splink__blocked_id_pairs_df367c62d ] and has output table name: __splink__blocked_id_pairs
    Pipeline part 2: CTE reads tables [ __splink__df_concat_with_tf_4b5d5ed6b ] and has output table name: __splink__df_concat_with_tf
    Pipeline part 3: CTE reads tables [ __splink__df_concat_with_tf AS l, __splink__blocked_id_pairs AS b, __splink__df_concat_with_tf AS r ] and has output table name: blocked_with_cols
    Pipeline part 4: CTE reads tables [ blocked_with_cols ] and has output table name: __splink__df_comparison_vectors
    Pipeline part 5: CTE reads tables [ __splink__df_comparison_vectors ] and has output table name: __splink__df_match_weight_parts
    Pipeline part 6: CTE reads tables [ __splink__df_match_weight_parts ] and has output table name: __splink__df_self_link
Executing sql to create __splink__df_self_link as physical name __splink__df_self_link_033da4740

------Start SQL---------
CREATE TABLE __splink__df_self_link_033da4740 AS 
WITH

__splink__blocked_id_pairs as (
select * from __splink__blocked_id_pairs_df367c62d), 

__splink__df_concat_with_tf as (
select * from __splink__df_concat_with_tf_4b5d5ed6b), 

blocked_with_cols as (
    select "l"."addid" AS "addid_l", 
"r"."addid" AS "addid_r", 
"l"."accuracy" AS "accuracy_l", 
"r"."accuracy" AS "accuracy_r", 
"l"."saddstr" AS "saddstr_l", 
"r"."saddstr" AS "saddstr_r", 
"l"."geohash_self_and_neighbors7" AS "geohash_self_and_neighbors7_l", 
"r"."geohash_self_and_neighbors7" AS "geohash_self_and_neighbors7_r", 
"l"."geohash_self_and_neighbors8" AS "geohash_self_and_neighbors8_l", 
"r"."geohash_self_and_neighbors8" AS "geohash_self_and_neighbors8_r", 
"l"."scity" AS "scity_l", 
"r"."scity" AS "scity_r", 
"l"."tf_scity" AS "tf_scity_l", 
"r"."tf_scity" AS "tf_scity_r", 
"l"."szip5" AS "szip5_l", 
"r"."szip5" AS "szip5_r", 
"l"."tf_szip5" AS "tf_szip5_l", 
"r"."tf_szip5" AS "tf_szip5_r", b.match_key
    from __splink__df_concat_with_tf as l
    inner join __splink__blocked_id_pairs as b
    on l."addid" = b.join_key_l
    inner join __splink__df_concat_with_tf as r
    on r."addid" = b.join_key_r
    ), 

__splink__df_comparison_vectors as (
    select "addid_l", 
"addid_r", 
CASE WHEN (saddstr_l IS NULL OR saddstr_r IS NULL) AND (geohash_self_and_neighbors7_l IS NULL OR geohash_self_and_neighbors7_r IS NULL OR accuracy_l IS NULL OR accuracy_r IS NULL) THEN -1 WHEN (saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 10 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 9 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 8 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 7 WHEN (saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 6 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 5 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 4 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 3 WHEN array_length(list_intersect(geohash_self_and_neighbors8_l, geohash_self_and_neighbors8_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL) THEN 2 WHEN array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL) THEN 1 ELSE 0 END as gamma_street, 
CASE WHEN "scity_l" IS NULL OR "scity_r" IS NULL THEN -1 WHEN "scity_l" = "scity_r" THEN 1 ELSE 0 END as gamma_scity, 
"tf_scity_l", 
"tf_scity_r", 
CASE WHEN szip5_l IS NULL OR szip5_r IS NULL THEN -1 WHEN szip5_l = szip5_r THEN 4 WHEN substr(szip5_l, 1, 4) = substr(szip5_r, 1, 4) THEN 3 WHEN substr(szip5_l, 1, 3) = substr(szip5_r, 1, 3) THEN 2 WHEN damerau_levenshtein(szip5_l, szip5_r) <= 2 THEN 1 ELSE 0 END as gamma_szip5, 
"tf_szip5_l", 
"tf_szip5_r", 
match_key 
    from blocked_with_cols
    ), 

__splink__df_match_weight_parts as (
    select "addid_l","addid_r",gamma_street,CASE 
WHEN
gamma_street = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_street = 10
THEN cast(49678.52911992316 as float8)
 
WHEN
gamma_street = 9
THEN cast(21536.078003967268 as float8)
 
WHEN
gamma_street = 8
THEN cast(31479.454181137276 as float8)
 
WHEN
gamma_street = 7
THEN cast(42510.434297932894 as float8)
 
WHEN
gamma_street = 6
THEN cast(2223.603135817533 as float8)
 
WHEN
gamma_street = 5
THEN cast(230.75868436173087 as float8)
 
WHEN
gamma_street = 4
THEN cast(113.05510082601522 as float8)
 
WHEN
gamma_street = 3
THEN cast(85.75861558013442 as float8)
 
WHEN
gamma_street = 2
THEN cast(14.205441987736027 as float8)
 
WHEN
gamma_street = 1
THEN cast(65.00410034352372 as float8)
 
WHEN
gamma_street = 0
THEN cast(0.0006700287948115862 as float8)
 END as bf_street ,gamma_scity,CASE 
WHEN
gamma_scity = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_scity = 1
THEN cast(20.338205372111965 as float8)
 
WHEN
gamma_scity = 0
THEN cast(0.04420372330105158 as float8)
 END as bf_scity ,CASE WHEN  gamma_scity = -1 then cast(1 as float8) WHEN  gamma_scity = 1 then
    (CASE WHEN coalesce("tf_scity_l", "tf_scity_r") is not null
    THEN
    POW(
        cast(0.04709747703972181 as float8) /
    (CASE
        WHEN coalesce("tf_scity_l", "tf_scity_r") >= coalesce("tf_scity_r", "tf_scity_l")
        THEN coalesce("tf_scity_l", "tf_scity_r")
        ELSE coalesce("tf_scity_r", "tf_scity_l")
    END)
    ,
        cast(1.0 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_scity = 0 then cast(1 as float8) END as bf_tf_adj_scity ,gamma_szip5,CASE 
WHEN
gamma_szip5 = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_szip5 = 4
THEN cast(245.60618690986593 as float8)
 
WHEN
gamma_szip5 = 3
THEN cast(0.3741913067691637 as float8)
 
WHEN
gamma_szip5 = 2
THEN cast(0.23762349361145213 as float8)
 
WHEN
gamma_szip5 = 1
THEN cast(2.973602866192421e-13 as float8)
 
WHEN
gamma_szip5 = 0
THEN cast(6.578648872187775e-15 as float8)
 END as bf_szip5 ,CASE WHEN  gamma_szip5 = -1 then cast(1 as float8) WHEN  gamma_szip5 = 4 then
    (CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
    THEN
    POW(
        cast(0.003993214383316761 as float8) /
    (CASE
        WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
        AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
            THEN coalesce("tf_szip5_l", "tf_szip5_r")
        WHEN coalesce("tf_szip5_r", "tf_szip5_l")  > cast(0.001 as float8)
            THEN coalesce("tf_szip5_r", "tf_szip5_l")
        ELSE cast(0.001 as float8)
    END)
    ,
        cast(1.0 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_szip5 = 3 then
    (CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
    THEN
    POW(
        cast(0.003993214383316761 as float8) /
    (CASE
        WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
        AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
            THEN coalesce("tf_szip5_l", "tf_szip5_r")
        WHEN coalesce("tf_szip5_r", "tf_szip5_l")  > cast(0.001 as float8)
            THEN coalesce("tf_szip5_r", "tf_szip5_l")
        ELSE cast(0.001 as float8)
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_szip5 = 2 then
    (CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
    THEN
    POW(
        cast(0.003993214383316761 as float8) /
    (CASE
        WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
        AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
            THEN coalesce("tf_szip5_l", "tf_szip5_r")
        WHEN coalesce("tf_szip5_r", "tf_szip5_l")  > cast(0.001 as float8)
            THEN coalesce("tf_szip5_r", "tf_szip5_l")
        ELSE cast(0.001 as float8)
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_szip5 = 1 then
    (CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
    THEN
    POW(
        cast(0.003993214383316761 as float8) /
    (CASE
        WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
        AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
            THEN coalesce("tf_szip5_l", "tf_szip5_r")
        WHEN coalesce("tf_szip5_r", "tf_szip5_l")  > cast(0.001 as float8)
            THEN coalesce("tf_szip5_r", "tf_szip5_l")
        ELSE cast(0.001 as float8)
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_szip5 = 0 then cast(1 as float8) END as bf_tf_adj_szip5  
    from __splink__df_comparison_vectors
    ) 

    select
    log2(cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5) as match_weight,
    CASE WHEN bf_street = cast('infinity' as float8) OR bf_scity = cast('infinity' as float8) OR bf_tf_adj_scity = cast('infinity' as float8) OR bf_szip5 = cast('infinity' as float8) OR bf_tf_adj_szip5 = cast('infinity' as float8) THEN 1.0 ELSE (cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5)/(1+(cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5)) END as match_probability,
    "addid_l","addid_r" 
    from __splink__df_match_weight_parts
    
    
-------End SQL-----------

Setting cache for __splink__df_self_link_033da4740 with physical name __splink__df_self_link_033da4740
SQL pipeline was passed inputs [] and output dataset __splink__df_unlinkables_proportions_cumulative
    Pipeline part 1: CTE reads tables [ __splink__df_self_link_033da4740 ] and has output table name: __splink__df_round_self_link
    Pipeline part 2: CTE reads tables [ __splink__df_round_self_link ] and has output table name: __splink__df_unlinkables_proportions
    Pipeline part 3: CTE reads tables [ __splink__df_unlinkables_proportions ] and has output table name: __splink__df_unlinkables_proportions_cumulative
Executing sql to create __splink__df_unlinkables_proportions_cumulative as physical name __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1

------Start SQL---------
CREATE TABLE __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1 AS 
WITH

__splink__df_round_self_link as (
        select
        round(match_weight, 2) as match_weight,
        round(match_probability, 5) as match_probability
        from __splink__df_self_link_033da4740
    ), 

__splink__df_unlinkables_proportions as (
        select
        max(match_weight) as match_weight,
        match_probability,
        count(*) / cast( sum(count(*)) over () as float) as prop
        from __splink__df_round_self_link
        group by match_probability
        order by match_probability
    ) 

        select *,
        sum(prop) over(order by match_probability) as cum_prop
        from __splink__df_unlinkables_proportions
        where match_probability < 1
    
-------End SQL-----------

Setting cache for __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1 with physical name __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1
Dropping table with templated name __splink__df_unlinkables_proportions_cumulative and physical name __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1

When I increase the number of rows to 1 million, the process begins to drag on and seems to get stuck about here:

Here is the logging of that unsuccessful SQL query

SQL pipeline was passed inputs [] and output dataset __splink__df_concat_with_tf
    Pipeline part 1: CTE reads tables [ regrid_sample ] and has output table name: __splink__df_concat
    Pipeline part 2: CTE reads tables [ __splink__df_concat ] and has output table name: __splink__df_tf_szip5
    Pipeline part 3: CTE reads tables [ __splink__df_concat ] and has output table name: __splink__df_tf_scity
    Pipeline part 4: CTE reads tables [ __splink__df_tf_scity, __splink__df_tf_szip5, __splink__df_concat ] and has output table name: __splink__df_concat_with_tf
Executing sql to create __splink__df_concat_with_tf as physical name __splink__df_concat_with_tf_41f206f06

------Start SQL---------
CREATE TABLE __splink__df_concat_with_tf_41f206f06 AS 
WITH

__splink__df_concat as (
            select "addid", "saddno", "saddpref", "saddstr", "saddsttyp", "saddstsuf", "sunit", "scity", "szip5", "accuracy", "geohash_self_and_neighbors8", "geohash_self_and_neighbors7", "source"
            , random() as __splink_salt
            from regrid_sample
            ), 

__splink__df_tf_szip5 as (
    select
    "szip5", cast(count(*) as float8) / (select
        count("szip5") as total from __splink__df_concat)
            as "tf_szip5"
    from __splink__df_concat
    where "szip5" is not null
    group by "szip5"
    ), 

__splink__df_tf_scity as (
    select
    "scity", cast(count(*) as float8) / (select
        count("scity") as total from __splink__df_concat)
            as "tf_scity"
    from __splink__df_concat
    where "scity" is not null
    group by "scity"
    ) 

    select __splink__df_concat.*, __splink__df_tf_szip5."tf_szip5", __splink__df_tf_scity."tf_scity"
    from __splink__df_concat
    left join __splink__df_tf_szip5 on __splink__df_concat."szip5" = __splink__df_tf_szip5."szip5" left join __splink__df_tf_scity on __splink__df_concat."scity" = __splink__df_tf_scity."scity"
    
-------End SQL-----------

Setting cache for __splink__df_concat_with_tf_41f206f06 with physical name __splink__df_concat_with_tf_41f206f06
Setting cache for __splink__df_concat_with_tf with physical name __splink__df_concat_with_tf_41f206f06
SQL pipeline was passed inputs [__splink__df_concat_with_tf_41f206f06] and output dataset __splink__blocked_id_pairs
    Pipeline part 1: CTE reads tables [ __splink__df_concat_with_tf_41f206f06 ] and has output table name: __splink__df_concat_with_tf
    Pipeline part 2: CTE reads tables [ __splink__df_concat_with_tf AS r, __splink__df_concat_with_tf AS l ] and has output table name: __splink__blocked_id_pairs
Executing sql to create __splink__blocked_id_pairs as physical name __splink__blocked_id_pairs_0d1339724

------Start SQL---------
CREATE TABLE __splink__blocked_id_pairs_0d1339724 AS 
WITH

__splink__df_concat_with_tf as (
select * from __splink__df_concat_with_tf_41f206f06) 

            select
            '0' as match_key,
            l."addid" as join_key_l,
            r."addid" as join_key_r
            from __splink__df_concat_with_tf as l
            inner join __splink__df_concat_with_tf as r
            on
            (l."addid" = r."addid")
             where 1=1 
            
            
-------End SQL-----------

Setting cache for __splink__blocked_id_pairs_0d1339724 with physical name __splink__blocked_id_pairs_0d1339724
SQL pipeline was passed inputs [__splink__blocked_id_pairs_0d1339724, __splink__df_concat_with_tf_41f206f06] and output dataset __splink__df_self_link
    Pipeline part 1: CTE reads tables [ __splink__blocked_id_pairs_0d1339724 ] and has output table name: __splink__blocked_id_pairs
    Pipeline part 2: CTE reads tables [ __splink__df_concat_with_tf_41f206f06 ] and has output table name: __splink__df_concat_with_tf
    Pipeline part 3: CTE reads tables [ __splink__df_concat_with_tf AS r, __splink__blocked_id_pairs AS b, __splink__df_concat_with_tf AS l ] and has output table name: blocked_with_cols
    Pipeline part 4: CTE reads tables [ blocked_with_cols ] and has output table name: __splink__df_comparison_vectors
    Pipeline part 5: CTE reads tables [ __splink__df_comparison_vectors ] and has output table name: __splink__df_match_weight_parts
    Pipeline part 6: CTE reads tables [ __splink__df_match_weight_parts ] and has output table name: __splink__df_self_link
Executing sql to create __splink__df_self_link as physical name __splink__df_self_link_255e31a79

------Start SQL---------
CREATE TABLE __splink__df_self_link_255e31a79 AS 
WITH

__splink__blocked_id_pairs as (
select * from __splink__blocked_id_pairs_0d1339724), 

__splink__df_concat_with_tf as (
select * from __splink__df_concat_with_tf_41f206f06), 

blocked_with_cols as (
    select "l"."addid" AS "addid_l", 
"r"."addid" AS "addid_r", 
"l"."geohash_self_and_neighbors7" AS "geohash_self_and_neighbors7_l", 
"r"."geohash_self_and_neighbors7" AS "geohash_self_and_neighbors7_r", 
"l"."accuracy" AS "accuracy_l", 
"r"."accuracy" AS "accuracy_r", 
"l"."saddstr" AS "saddstr_l", 
"r"."saddstr" AS "saddstr_r", 
"l"."geohash_self_and_neighbors8" AS "geohash_self_and_neighbors8_l", 
"r"."geohash_self_and_neighbors8" AS "geohash_self_and_neighbors8_r", 
"l"."scity" AS "scity_l", 
"r"."scity" AS "scity_r", 
"l"."tf_scity" AS "tf_scity_l", 
"r"."tf_scity" AS "tf_scity_r", 
"l"."szip5" AS "szip5_l", 
"r"."szip5" AS "szip5_r", 
"l"."tf_szip5" AS "tf_szip5_l", 
"r"."tf_szip5" AS "tf_szip5_r", b.match_key
    from __splink__df_concat_with_tf as l
    inner join __splink__blocked_id_pairs as b
    on l."addid" = b.join_key_l
    inner join __splink__df_concat_with_tf as r
    on r."addid" = b.join_key_r
    ), 

__splink__df_comparison_vectors as (
    select "addid_l", 
"addid_r", 
CASE WHEN (saddstr_l IS NULL OR saddstr_r IS NULL) AND (geohash_self_and_neighbors7_l IS NULL OR geohash_self_and_neighbors7_r IS NULL OR accuracy_l IS NULL OR accuracy_r IS NULL) THEN -1 WHEN (saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 10 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 9 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 8 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 7 WHEN (saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 6 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 5 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 4 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 3 WHEN array_length(list_intersect(geohash_self_and_neighbors8_l, geohash_self_and_neighbors8_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL) THEN 2 WHEN array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL) THEN 1 ELSE 0 END as gamma_street, 
CASE WHEN "scity_l" IS NULL OR "scity_r" IS NULL THEN -1 WHEN "scity_l" = "scity_r" THEN 1 ELSE 0 END as gamma_scity, 
"tf_scity_l", 
"tf_scity_r", 
CASE WHEN szip5_l IS NULL OR szip5_r IS NULL THEN -1 WHEN szip5_l = szip5_r THEN 4 WHEN substr(szip5_l, 1, 4) = substr(szip5_r, 1, 4) THEN 3 WHEN substr(szip5_l, 1, 3) = substr(szip5_r, 1, 3) THEN 2 WHEN damerau_levenshtein(szip5_l, szip5_r) <= 2 THEN 1 ELSE 0 END as gamma_szip5, 
"tf_szip5_l", 
"tf_szip5_r", 
match_key 
    from blocked_with_cols
    ), 

__splink__df_match_weight_parts as (
    select "addid_l","addid_r",gamma_street,CASE 
WHEN
gamma_street = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_street = 10
THEN cast(49678.52911992316 as float8)
 
WHEN
gamma_street = 9
THEN cast(21536.078003967268 as float8)
 
WHEN
gamma_street = 8
THEN cast(31479.454181137276 as float8)
 
WHEN
gamma_street = 7
THEN cast(42510.434297932894 as float8)
 
WHEN
gamma_street = 6
THEN cast(2223.603135817533 as float8)
 
WHEN
gamma_street = 5
THEN cast(230.75868436173087 as float8)
 
WHEN
gamma_street = 4
THEN cast(113.05510082601522 as float8)
 
WHEN
gamma_street = 3
THEN cast(85.75861558013442 as float8)
 
WHEN
gamma_street = 2
THEN cast(14.205441987736027 as float8)
 
WHEN
gamma_street = 1
THEN cast(65.00410034352372 as float8)
 
WHEN
gamma_street = 0
THEN cast(0.0006700287948115862 as float8)
 END as bf_street ,gamma_scity,CASE 
WHEN
gamma_scity = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_scity = 1
THEN cast(20.338205372111965 as float8)
 
WHEN
gamma_scity = 0
THEN cast(0.04420372330105158 as float8)
 END as bf_scity ,CASE WHEN  gamma_scity = -1 then cast(1 as float8) WHEN  gamma_scity = 1 then
    (CASE WHEN coalesce("tf_scity_l", "tf_scity_r") is not null
    THEN
    POW(
        cast(0.04709747703972181 as float8) /
    (CASE
        WHEN coalesce("tf_scity_l", "tf_scity_r") >= coalesce("tf_scity_r", "tf_scity_l")
        THEN coalesce("tf_scity_l", "tf_scity_r")
        ELSE coalesce("tf_scity_r", "tf_scity_l")
    END)
    ,
        cast(1.0 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_scity = 0 then cast(1 as float8) END as bf_tf_adj_scity ,gamma_szip5,CASE 
WHEN
gamma_szip5 = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_szip5 = 4
THEN cast(245.60618690986593 as float8)
 
WHEN
gamma_szip5 = 3
THEN cast(0.3741913067691637 as float8)
 
WHEN
gamma_szip5 = 2
THEN cast(0.23762349361145213 as float8)
 
WHEN
gamma_szip5 = 1
THEN cast(2.973602866192421e-13 as float8)
 
WHEN
gamma_szip5 = 0
THEN cast(6.578648872187775e-15 as float8)
 END as bf_szip5 ,CASE WHEN  gamma_szip5 = -1 then cast(1 as float8) WHEN  gamma_szip5 = 4 then
    (CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
    THEN
    POW(
        cast(0.003993214383316761 as float8) /
    (CASE
        WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
        AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
            THEN coalesce("tf_szip5_l", "tf_szip5_r")
        WHEN coalesce("tf_szip5_r", "tf_szip5_l")  > cast(0.001 as float8)
            THEN coalesce("tf_szip5_r", "tf_szip5_l")
        ELSE cast(0.001 as float8)
    END)
    ,
        cast(1.0 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_szip5 = 3 then
    (CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
    THEN
    POW(
        cast(0.003993214383316761 as float8) /
    (CASE
        WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
        AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
            THEN coalesce("tf_szip5_l", "tf_szip5_r")
        WHEN coalesce("tf_szip5_r", "tf_szip5_l")  > cast(0.001 as float8)
            THEN coalesce("tf_szip5_r", "tf_szip5_l")
        ELSE cast(0.001 as float8)
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_szip5 = 2 then
    (CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
    THEN
    POW(
        cast(0.003993214383316761 as float8) /
    (CASE
        WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
        AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
            THEN coalesce("tf_szip5_l", "tf_szip5_r")
        WHEN coalesce("tf_szip5_r", "tf_szip5_l")  > cast(0.001 as float8)
            THEN coalesce("tf_szip5_r", "tf_szip5_l")
        ELSE cast(0.001 as float8)
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_szip5 = 1 then
    (CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
    THEN
    POW(
        cast(0.003993214383316761 as float8) /
    (CASE
        WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
        AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
            THEN coalesce("tf_szip5_l", "tf_szip5_r")
        WHEN coalesce("tf_szip5_r", "tf_szip5_l")  > cast(0.001 as float8)
            THEN coalesce("tf_szip5_r", "tf_szip5_l")
        ELSE cast(0.001 as float8)
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_szip5 = 0 then cast(1 as float8) END as bf_tf_adj_szip5  
    from __splink__df_comparison_vectors
    ) 

    select
    log2(cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5) as match_weight,
    CASE WHEN bf_street = cast('infinity' as float8) OR bf_scity = cast('infinity' as float8) OR bf_tf_adj_scity = cast('infinity' as float8) OR bf_szip5 = cast('infinity' as float8) OR bf_tf_adj_szip5 = cast('infinity' as float8) THEN 1.0 ELSE (cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5)/(1+(cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5)) END as match_probability,
    "addid_l","addid_r" 
    from __splink__df_match_weight_parts
    
    
-------End SQL-----------

Thanks again for the help!

P.S. Unfortunately I will miss your presentation at Princeton this week but I am encouraging all of my colleagues to attend.

RobinL Oct 14, 2024
Maintainer

There's nothing that jumps out at me from the SQL. I have two hypotheses:

We've had problems with things getting stuck on duckdb 1.0.0 (not the unlinkables chart, but some other QA charts). upgrading to 1.1.1 has fixed it for us.
It's possible the chart is not optimised for large amounts of data, you could try returning it as a dict by setting as_dict to true to test this hypothesis:

splink/splink/internals/linker_components/evaluation.py

Line 338 in 83df823

def unlinkables_chart(

I could be neither, but both are worth a try if only to discount

AdamChapnik1 Oct 14, 2024
Author

I'm thinking the issue might be either a different problem with duckdb or an issue with the computation. I updated to duckdb 1.1.2 without any new luck, and I'm not seeing any progress with using as_dict = True. Maybe I'm expecting this to be much faster than it can be and should throw even more computational power at it as I did for the training and prediction, and perhaps give it longer than four hours? What's odd is that I've had success with comparably large datasets, which produced the same chart in seconds, and there isn't anything obviously different about this one nor the model. If I change the settings to dedupe_only, I can successfully create the unlinkables chart for other smaller tables I had linked with this same model, so I do think the issue is the computation.

RobinL Oct 22, 2024
Maintainer

I've had another look. The following script creates 1 million fake records according to your schema and then produces the unlinkables chart:

Click to expand

import pandas as pd
import time 
from splink import DuckDBAPI, Linker

db_api = DuckDBAPI()


data = [
    {
        "saddno": "123",
        "saddstr": "Main St",
        "scity": "New York",
        "szip5": "10001",
        "geohash_self_and_neighbors7": ["dr5r7p6", "dr5r7p6"],
        "accuracy": 0.98,
        "geohash_self_and_neighbors8": ["dr5r7p63", "dr5r7p63"],
    },
    {
        "saddno": "456",
        "saddstr": "Broadway",
        "scity": "Los Angeles",
        "szip5": "90001",
        "geohash_self_and_neighbors7": ["9q5cgvc"],
        "accuracy": 0.95,
        "geohash_self_and_neighbors8": ["9q5cgvcc", "9q5cgvcc"],
    },
    {
        "saddno": "789",
        "saddstr": "Elm St",
        "scity": "Chicago",
        "szip5": "60601",
        "geohash_self_and_neighbors7": ["dp3wj1j", "dp3wj1j"],
        "accuracy": 0.92,
        "geohash_self_and_neighbors8": ["dp3wj1j8", "dp3wj1j8"],
    },
]

df = pd.DataFrame(data)

# Create 1 million rows by repeating the existing rows
df = pd.concat([df] * (10**6 // len(df)), ignore_index=True)


# Reset the index and use it as the 'addid' column
df = df.reset_index(drop=True)
df["addid"] = df.index

df = df.copy()

settings_dict = {
    "link_type": "dedupe_only",
    "probability_two_random_records_match": 5.580857250187588e-07,
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
    "additional_columns_to_retain": ["addid"],
    "sql_dialect": "duckdb",
    "linker_uid": "zw7adf0n",
    "em_convergence": 0.0001,
    "max_iterations": 25,
    "bayes_factor_column_prefix": "bf_",
    "term_frequency_adjustment_column_prefix": "tf_",
    "comparison_vector_value_column_prefix": "gamma_",
    "unique_id_column_name": "addid",
    "source_dataset_column_name": "source_dataset",
    "blocking_rules_to_generate_predictions": [
        {
            "blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.scity = r.scity AND l.szip5 = r.szip5",
            "sql_dialect": "duckdb",
        },
        {
            "blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.scity = r.scity",
            "sql_dialect": "duckdb",
        },
        {
            "blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.szip5 = r.szip5",
            "sql_dialect": "duckdb",
        },
        {
            "blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr AND substr(l.szip5,1,4) = substr(r.szip5,1,4)",
            "sql_dialect": "duckdb",
        },
        {
            "blocking_rule": "l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.scity = r.scity AND l.szip5 = r.szip5",
            "sql_dialect": "duckdb",
        },
        {
            "blocking_rule": "l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.scity = r.scity",
            "sql_dialect": "duckdb",
        },
        {
            "blocking_rule": "l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.szip5 = r.szip5",
            "sql_dialect": "duckdb",
        },
        {
            "blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr",
            "sql_dialect": "duckdb",
        },
        {
            "blocking_rule": "l.saddno = r.saddno AND l.scity = r.scity",
            "sql_dialect": "duckdb",
        },
        {
            "blocking_rule": "l.saddno = r.saddno AND l.szip5 = r.szip5",
            "sql_dialect": "duckdb",
        },
    ],
    "comparisons": [
        {
            "output_column_name": "street",
            "comparison_levels": [
                {
                    "sql_condition": "(saddstr_l IS NULL OR saddstr_r IS NULL) AND (geohash_self_and_neighbors7_l IS NULL OR geohash_self_and_neighbors7_r IS NULL OR accuracy_l IS NULL OR accuracy_r IS NULL)",
                    "label_for_charts": "Null",
                    "is_null_level": True,
                },
                {
                    "sql_condition": "(saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))",
                    "label_for_charts": "Exact match (saddstr) and Intersect (geohash 7)",
                    "m_probability": 0.5184678962113823,
                    "u_probability": 1.0436458272744132e-05,
                },
                {
                    "sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))",
                    "label_for_charts": "Jaro-Winkler (saddstr) >= 0.92 and Intersect (geohash 7)",
                    "m_probability": 0.011895738173284133,
                    "u_probability": 5.523632562573723e-07,
                },
                {
                    "sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))",
                    "label_for_charts": "Jaro-Winkler (saddstr) >= 0.88 and Intersect (geohash 7)",
                    "m_probability": 0.019878517576250577,
                    "u_probability": 6.314759290890734e-07,
                },
                {
                    "sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))",
                    "label_for_charts": "Jaro-Winkler (saddstr) >= 0.84 and Intersect (geohash 7)",
                    "m_probability": 0.018633469905727222,
                    "u_probability": 4.3832697109455996e-07,
                },
                {
                    "sql_condition": "(saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))",
                    "label_for_charts": "Exact match (saddstr)",
                    "m_probability": 0.3320169362895465,
                    "u_probability": 0.0001493148354314929,
                },
                {
                    "sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))",
                    "label_for_charts": "Jaro-Winkler (saddstr) >= 0.92",
                    "m_probability": 0.019878741014857884,
                    "u_probability": 8.614514799233525e-05,
                },
                {
                    "sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))",
                    "label_for_charts": "Jaro-Winkler (saddstr) >= 0.88",
                    "m_probability": 0.03469364631438866,
                    "u_probability": 0.0003068737815534748,
                },
                {
                    "sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))",
                    "label_for_charts": "Jaro-Winkler (saddstr) >= 0.84",
                    "m_probability": 0.04372303645073314,
                    "u_probability": 0.0005098384127933774,
                },
                {
                    "sql_condition": "array_length(list_intersect(geohash_self_and_neighbors8_l, geohash_self_and_neighbors8_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL)",
                    "label_for_charts": "Intersect (geohash 8)",
                    "m_probability": 1.7211818229633315e-06,
                    "u_probability": 1.211635529854881e-07,
                },
                {
                    "sql_condition": "array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL)",
                    "label_for_charts": "Intersect (geohash 7)",
                    "m_probability": 0.00014098268683681274,
                    "u_probability": 2.168827598440237e-06,
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.0006693141951697755,
                    "u_probability": 0.9989334792066487,
                },
            ],
            "comparison_description": "CustomComparison",
        },
        {
            "output_column_name": "scity",
            "comparison_levels": [
                {
                    "sql_condition": '"scity_l" IS NULL OR "scity_r" IS NULL',
                    "label_for_charts": "scity is NULL",
                    "is_null_level": True,
                },
                {
                    "sql_condition": '"scity_l" = "scity_r"',
                    "label_for_charts": "Exact match on scity",
                    "m_probability": 0.95787816054219,
                    "u_probability": 0.04709747703972181,
                    "tf_adjustment_column": "scity",
                    "tf_adjustment_weight": 1.0,
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.04212183945781009,
                    "u_probability": 0.9529025229602782,
                },
            ],
            "comparison_description": "ExactMatch",
        },
        {
            "output_column_name": "szip5",
            "comparison_levels": [
                {
                    "sql_condition": "szip5_l IS NULL OR szip5_r IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": True,
                },
                {
                    "sql_condition": "szip5_l = szip5_r",
                    "label_for_charts": "Exact match",
                    "m_probability": 0.9807581582000615,
                    "u_probability": 0.003993214383316761,
                    "tf_adjustment_column": "szip5",
                    "tf_minimum_u_value": 0.001,
                    "tf_adjustment_weight": 1.0,
                },
                {
                    "sql_condition": "substr(szip5_l, 1, 4) = substr(szip5_r, 1, 4)",
                    "label_for_charts": "Exact match (zip4)",
                    "m_probability": 0.005447918971943777,
                    "u_probability": 0.01455918102155314,
                    "tf_adjustment_column": "szip5",
                    "tf_minimum_u_value": 0.001,
                    "tf_adjustment_weight": 0.5,
                },
                {
                    "sql_condition": "substr(szip5_l, 1, 3) = substr(szip5_r, 1, 3)",
                    "label_for_charts": "Exact match (zip3)",
                    "m_probability": 0.013793922827954863,
                    "u_probability": 0.058049490891291536,
                    "tf_adjustment_column": "szip5",
                    "tf_minimum_u_value": 0.001,
                    "tf_adjustment_weight": 0.5,
                },
                {
                    "sql_condition": "damerau_levenshtein(szip5_l, szip5_r) <= 2",
                    "label_for_charts": "Damerau-Levenshtein <= 2",
                    "m_probability": 3.460899719747275e-14,
                    "u_probability": 0.11638742210989385,
                    "tf_adjustment_column": "szip5",
                    "tf_minimum_u_value": 0.001,
                    "tf_adjustment_weight": 0.5,
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 5.30903997609798e-15,
                    "u_probability": 0.8070106915939447,
                },
            ],
            "comparison_description": "CustomComparison",
        },
    ],
}

linker = Linker(df, settings_dict, db_api)

start_time = time.time()
linker.evaluation.unlinkables_chart()
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")

It executes successfully on my laptop, with the Splink part taking 6 seconds

In terms of data size - that shouldn't be causing an issue here. The unlinkables chart produces a self link, so with a million reocrds it's only having to evaluate a million comparisons.

All I can't think of is there's something in your dataset that's causing a problem with duckdb

AdamChapnik1 Nov 8, 2024
Author

Thanks for taking a deeper look at this. I'll have to come back to it later—I've been sidetracked with other things. I'm a bit at a loss how to reproduce the error (if there is one) but I will follow up if I can figure out the cause.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't produce any evaluation charts from saved model #2449

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can't produce any evaluation charts from saved model #2449

AdamChapnik1 Oct 4, 2024

Replies: 1 comment · 5 replies

RobinL Oct 7, 2024 Maintainer

AdamChapnik1 Oct 14, 2024 Author

RobinL Oct 14, 2024 Maintainer

AdamChapnik1 Oct 14, 2024 Author

RobinL Oct 22, 2024 Maintainer

AdamChapnik1 Nov 8, 2024 Author

AdamChapnik1
Oct 4, 2024

Replies: 1 comment 5 replies

RobinL
Oct 7, 2024
Maintainer

AdamChapnik1 Oct 14, 2024
Author

RobinL Oct 14, 2024
Maintainer

AdamChapnik1 Oct 14, 2024
Author

RobinL Oct 22, 2024
Maintainer

AdamChapnik1 Nov 8, 2024
Author