-
Notifications
You must be signed in to change notification settings - Fork 167
Can't produce any evaluation charts from saved model #2449
Replies: 1 comment · 5 replies
-
Not sure. The unlinkables chart is the only one that does any calculations against the backend database. TO debug, the first thing I'd probably do is see whether you still get the error on smaller datasets e.g. if you do
The other thing I'd look at is turning on logging at a lower level to see the SQL that's causing the problem.
Finally - you can probably post your settings here. They will reveal the name of the columns you're using in your script, and how you've set the model up, but the settings.json doesn't contain any actual data from your input datasets |
Beta Was this translation helpful? Give feedback.
All reactions
-
I've tried to debug what I can but I am not making much progress so if it you might be able to make more sense of this I'm dropping here all of the information I have now. I would really appreciate any insights. Here are the settings:{'link_type': 'dedupe_only',
'probability_two_random_records_match': 5.580857250187588e-07,
'retain_matching_columns': False,
'retain_intermediate_calculation_columns': False,
'additional_columns_to_retain': ['addid'],
'sql_dialect': 'duckdb',
'linker_uid': 'zw7adf0n',
'em_convergence': 0.0001,
'max_iterations': 25,
'bayes_factor_column_prefix': 'bf_',
'term_frequency_adjustment_column_prefix': 'tf_',
'comparison_vector_value_column_prefix': 'gamma_',
'unique_id_column_name': 'addid',
'source_dataset_column_name': 'source_dataset',
'blocking_rules_to_generate_predictions': [{'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.scity = r.scity AND l.szip5 = r.szip5',
'sql_dialect': 'duckdb'},
{'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.scity = r.scity',
'sql_dialect': 'duckdb'},
{'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.szip5 = r.szip5',
'sql_dialect': 'duckdb'},
{'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr AND substr(l.szip5,1,4) = substr(r.szip5,1,4)',
'sql_dialect': 'duckdb'},
{'blocking_rule': 'l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.scity = r.scity AND l.szip5 = r.szip5',
'sql_dialect': 'duckdb'},
{'blocking_rule': 'l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.scity = r.scity',
'sql_dialect': 'duckdb'},
{'blocking_rule': 'l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.szip5 = r.szip5',
'sql_dialect': 'duckdb'},
{'blocking_rule': 'l.saddno = r.saddno AND l.saddstr = r.saddstr',
'sql_dialect': 'duckdb'},
{'blocking_rule': 'l.saddno = r.saddno AND l.scity = r.scity',
'sql_dialect': 'duckdb'},
{'blocking_rule': 'l.saddno = r.saddno AND l.szip5 = r.szip5',
'sql_dialect': 'duckdb'}],
'comparisons': [{'output_column_name': 'street',
'comparison_levels': [{'sql_condition': '(saddstr_l IS NULL OR saddstr_r IS NULL) AND (geohash_self_and_neighbors7_l IS NULL OR geohash_self_and_neighbors7_r IS NULL OR accuracy_l IS NULL OR accuracy_r IS NULL)',
'label_for_charts': 'Null',
'is_null_level': True},
{'sql_condition': '(saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))',
'label_for_charts': 'Exact match (saddstr) and Intersect (geohash 7)',
'm_probability': 0.5184678962113823,
'u_probability': 1.0436458272744132e-05},
{'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))',
'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.92 and Intersect (geohash 7)',
'm_probability': 0.011895738173284133,
'u_probability': 5.523632562573723e-07},
{'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))',
'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.88 and Intersect (geohash 7)',
'm_probability': 0.019878517576250577,
'u_probability': 6.314759290890734e-07},
{'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))',
'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.84 and Intersect (geohash 7)',
'm_probability': 0.018633469905727222,
'u_probability': 4.3832697109455996e-07},
{'sql_condition': '(saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))',
'label_for_charts': 'Exact match (saddstr)',
'm_probability': 0.3320169362895465,
'u_probability': 0.0001493148354314929},
{'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))',
'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.92',
'm_probability': 0.019878741014857884,
'u_probability': 8.614514799233525e-05},
{'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))',
'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.88',
'm_probability': 0.03469364631438866,
'u_probability': 0.0003068737815534748},
{'sql_condition': '(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))',
'label_for_charts': 'Jaro-Winkler (saddstr) >= 0.84',
'm_probability': 0.04372303645073314,
'u_probability': 0.0005098384127933774},
{'sql_condition': 'array_length(list_intersect(geohash_self_and_neighbors8_l, geohash_self_and_neighbors8_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL)',
'label_for_charts': 'Intersect (geohash 8)',
'm_probability': 1.7211818229633315e-06,
'u_probability': 1.211635529854881e-07},
{'sql_condition': 'array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL)',
'label_for_charts': 'Intersect (geohash 7)',
'm_probability': 0.00014098268683681274,
'u_probability': 2.168827598440237e-06},
{'sql_condition': 'ELSE',
'label_for_charts': 'All other comparisons',
'm_probability': 0.0006693141951697755,
'u_probability': 0.9989334792066487}],
'comparison_description': 'CustomComparison'},
{'output_column_name': 'scity',
'comparison_levels': [{'sql_condition': '"scity_l" IS NULL OR "scity_r" IS NULL',
'label_for_charts': 'scity is NULL',
'is_null_level': True},
{'sql_condition': '"scity_l" = "scity_r"',
'label_for_charts': 'Exact match on scity',
'm_probability': 0.95787816054219,
'u_probability': 0.04709747703972181,
'tf_adjustment_column': 'scity',
'tf_adjustment_weight': 1.0},
{'sql_condition': 'ELSE',
'label_for_charts': 'All other comparisons',
'm_probability': 0.04212183945781009,
'u_probability': 0.9529025229602782}],
'comparison_description': 'ExactMatch'},
{'output_column_name': 'szip5',
'comparison_levels': [{'sql_condition': 'szip5_l IS NULL OR szip5_r IS NULL',
'label_for_charts': 'Null',
'is_null_level': True},
{'sql_condition': 'szip5_l = szip5_r',
'label_for_charts': 'Exact match',
'm_probability': 0.9807581582000615,
'u_probability': 0.003993214383316761,
'tf_adjustment_column': 'szip5',
'tf_minimum_u_value': 0.001,
'tf_adjustment_weight': 1.0},
{'sql_condition': 'substr(szip5_l, 1, 4) = substr(szip5_r, 1, 4)',
'label_for_charts': 'Exact match (zip4)',
'm_probability': 0.005447918971943777,
'u_probability': 0.01455918102155314,
'tf_adjustment_column': 'szip5',
'tf_minimum_u_value': 0.001,
'tf_adjustment_weight': 0.5},
{'sql_condition': 'substr(szip5_l, 1, 3) = substr(szip5_r, 1, 3)',
'label_for_charts': 'Exact match (zip3)',
'm_probability': 0.013793922827954863,
'u_probability': 0.058049490891291536,
'tf_adjustment_column': 'szip5',
'tf_minimum_u_value': 0.001,
'tf_adjustment_weight': 0.5},
{'sql_condition': 'damerau_levenshtein(szip5_l, szip5_r) <= 2',
'label_for_charts': 'Damerau-Levenshtein <= 2',
'm_probability': 3.460899719747275e-14,
'u_probability': 0.11638742210989385,
'tf_adjustment_column': 'szip5',
'tf_minimum_u_value': 0.001,
'tf_adjustment_weight': 0.5},
{'sql_condition': 'ELSE',
'label_for_charts': 'All other comparisons',
'm_probability': 5.30903997609798e-15,
'u_probability': 0.8070106915939447}],
'comparison_description': 'CustomComparison'}]} I am able to reduce one of the tables (source dataset 'r') to the first 100 rows, change the settings to Here is the logging of that successful SQL querySQL pipeline was passed inputs [] and output dataset __splink__df_concat_with_tf
Pipeline part 1: CTE reads tables [ regrid_sample ] and has output table name: __splink__df_concat
Pipeline part 2: CTE reads tables [ __splink__df_concat ] and has output table name: __splink__df_tf_szip5
Pipeline part 3: CTE reads tables [ __splink__df_concat ] and has output table name: __splink__df_tf_scity
Pipeline part 4: CTE reads tables [ __splink__df_tf_scity, __splink__df_tf_szip5, __splink__df_concat ] and has output table name: __splink__df_concat_with_tf
Executing sql to create __splink__df_concat_with_tf as physical name __splink__df_concat_with_tf_4b5d5ed6b
------Start SQL---------
CREATE TABLE __splink__df_concat_with_tf_4b5d5ed6b AS
WITH
__splink__df_concat as (
select "addid", "saddno", "saddpref", "saddstr", "saddsttyp", "saddstsuf", "sunit", "scity", "szip5", "accuracy", "geohash_self_and_neighbors8", "geohash_self_and_neighbors7", "source"
, random() as __splink_salt
from regrid_sample
),
__splink__df_tf_szip5 as (
select
"szip5", cast(count(*) as float8) / (select
count("szip5") as total from __splink__df_concat)
as "tf_szip5"
from __splink__df_concat
where "szip5" is not null
group by "szip5"
),
__splink__df_tf_scity as (
select
"scity", cast(count(*) as float8) / (select
count("scity") as total from __splink__df_concat)
as "tf_scity"
from __splink__df_concat
where "scity" is not null
group by "scity"
)
select __splink__df_concat.*, __splink__df_tf_szip5."tf_szip5", __splink__df_tf_scity."tf_scity"
from __splink__df_concat
left join __splink__df_tf_szip5 on __splink__df_concat."szip5" = __splink__df_tf_szip5."szip5" left join __splink__df_tf_scity on __splink__df_concat."scity" = __splink__df_tf_scity."scity"
-------End SQL-----------
Setting cache for __splink__df_concat_with_tf_4b5d5ed6b with physical name __splink__df_concat_with_tf_4b5d5ed6b
Setting cache for __splink__df_concat_with_tf with physical name __splink__df_concat_with_tf_4b5d5ed6b
SQL pipeline was passed inputs [__splink__df_concat_with_tf_4b5d5ed6b] and output dataset __splink__blocked_id_pairs
Pipeline part 1: CTE reads tables [ __splink__df_concat_with_tf_4b5d5ed6b ] and has output table name: __splink__df_concat_with_tf
Pipeline part 2: CTE reads tables [ __splink__df_concat_with_tf AS l, __splink__df_concat_with_tf AS r ] and has output table name: __splink__blocked_id_pairs
Executing sql to create __splink__blocked_id_pairs as physical name __splink__blocked_id_pairs_df367c62d
------Start SQL---------
CREATE TABLE __splink__blocked_id_pairs_df367c62d AS
WITH
__splink__df_concat_with_tf as (
select * from __splink__df_concat_with_tf_4b5d5ed6b)
select
'0' as match_key,
l."addid" as join_key_l,
r."addid" as join_key_r
from __splink__df_concat_with_tf as l
inner join __splink__df_concat_with_tf as r
on
(l."addid" = r."addid")
where 1=1
-------End SQL-----------
Setting cache for __splink__blocked_id_pairs_df367c62d with physical name __splink__blocked_id_pairs_df367c62d
SQL pipeline was passed inputs [__splink__blocked_id_pairs_df367c62d, __splink__df_concat_with_tf_4b5d5ed6b] and output dataset __splink__df_self_link
Pipeline part 1: CTE reads tables [ __splink__blocked_id_pairs_df367c62d ] and has output table name: __splink__blocked_id_pairs
Pipeline part 2: CTE reads tables [ __splink__df_concat_with_tf_4b5d5ed6b ] and has output table name: __splink__df_concat_with_tf
Pipeline part 3: CTE reads tables [ __splink__df_concat_with_tf AS l, __splink__blocked_id_pairs AS b, __splink__df_concat_with_tf AS r ] and has output table name: blocked_with_cols
Pipeline part 4: CTE reads tables [ blocked_with_cols ] and has output table name: __splink__df_comparison_vectors
Pipeline part 5: CTE reads tables [ __splink__df_comparison_vectors ] and has output table name: __splink__df_match_weight_parts
Pipeline part 6: CTE reads tables [ __splink__df_match_weight_parts ] and has output table name: __splink__df_self_link
Executing sql to create __splink__df_self_link as physical name __splink__df_self_link_033da4740
------Start SQL---------
CREATE TABLE __splink__df_self_link_033da4740 AS
WITH
__splink__blocked_id_pairs as (
select * from __splink__blocked_id_pairs_df367c62d),
__splink__df_concat_with_tf as (
select * from __splink__df_concat_with_tf_4b5d5ed6b),
blocked_with_cols as (
select "l"."addid" AS "addid_l",
"r"."addid" AS "addid_r",
"l"."accuracy" AS "accuracy_l",
"r"."accuracy" AS "accuracy_r",
"l"."saddstr" AS "saddstr_l",
"r"."saddstr" AS "saddstr_r",
"l"."geohash_self_and_neighbors7" AS "geohash_self_and_neighbors7_l",
"r"."geohash_self_and_neighbors7" AS "geohash_self_and_neighbors7_r",
"l"."geohash_self_and_neighbors8" AS "geohash_self_and_neighbors8_l",
"r"."geohash_self_and_neighbors8" AS "geohash_self_and_neighbors8_r",
"l"."scity" AS "scity_l",
"r"."scity" AS "scity_r",
"l"."tf_scity" AS "tf_scity_l",
"r"."tf_scity" AS "tf_scity_r",
"l"."szip5" AS "szip5_l",
"r"."szip5" AS "szip5_r",
"l"."tf_szip5" AS "tf_szip5_l",
"r"."tf_szip5" AS "tf_szip5_r", b.match_key
from __splink__df_concat_with_tf as l
inner join __splink__blocked_id_pairs as b
on l."addid" = b.join_key_l
inner join __splink__df_concat_with_tf as r
on r."addid" = b.join_key_r
),
__splink__df_comparison_vectors as (
select "addid_l",
"addid_r",
CASE WHEN (saddstr_l IS NULL OR saddstr_r IS NULL) AND (geohash_self_and_neighbors7_l IS NULL OR geohash_self_and_neighbors7_r IS NULL OR accuracy_l IS NULL OR accuracy_r IS NULL) THEN -1 WHEN (saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 10 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 9 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 8 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 7 WHEN (saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 6 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 5 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 4 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 3 WHEN array_length(list_intersect(geohash_self_and_neighbors8_l, geohash_self_and_neighbors8_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL) THEN 2 WHEN array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL) THEN 1 ELSE 0 END as gamma_street,
CASE WHEN "scity_l" IS NULL OR "scity_r" IS NULL THEN -1 WHEN "scity_l" = "scity_r" THEN 1 ELSE 0 END as gamma_scity,
"tf_scity_l",
"tf_scity_r",
CASE WHEN szip5_l IS NULL OR szip5_r IS NULL THEN -1 WHEN szip5_l = szip5_r THEN 4 WHEN substr(szip5_l, 1, 4) = substr(szip5_r, 1, 4) THEN 3 WHEN substr(szip5_l, 1, 3) = substr(szip5_r, 1, 3) THEN 2 WHEN damerau_levenshtein(szip5_l, szip5_r) <= 2 THEN 1 ELSE 0 END as gamma_szip5,
"tf_szip5_l",
"tf_szip5_r",
match_key
from blocked_with_cols
),
__splink__df_match_weight_parts as (
select "addid_l","addid_r",gamma_street,CASE
WHEN
gamma_street = -1
THEN cast(1.0 as float8)
WHEN
gamma_street = 10
THEN cast(49678.52911992316 as float8)
WHEN
gamma_street = 9
THEN cast(21536.078003967268 as float8)
WHEN
gamma_street = 8
THEN cast(31479.454181137276 as float8)
WHEN
gamma_street = 7
THEN cast(42510.434297932894 as float8)
WHEN
gamma_street = 6
THEN cast(2223.603135817533 as float8)
WHEN
gamma_street = 5
THEN cast(230.75868436173087 as float8)
WHEN
gamma_street = 4
THEN cast(113.05510082601522 as float8)
WHEN
gamma_street = 3
THEN cast(85.75861558013442 as float8)
WHEN
gamma_street = 2
THEN cast(14.205441987736027 as float8)
WHEN
gamma_street = 1
THEN cast(65.00410034352372 as float8)
WHEN
gamma_street = 0
THEN cast(0.0006700287948115862 as float8)
END as bf_street ,gamma_scity,CASE
WHEN
gamma_scity = -1
THEN cast(1.0 as float8)
WHEN
gamma_scity = 1
THEN cast(20.338205372111965 as float8)
WHEN
gamma_scity = 0
THEN cast(0.04420372330105158 as float8)
END as bf_scity ,CASE WHEN gamma_scity = -1 then cast(1 as float8) WHEN gamma_scity = 1 then
(CASE WHEN coalesce("tf_scity_l", "tf_scity_r") is not null
THEN
POW(
cast(0.04709747703972181 as float8) /
(CASE
WHEN coalesce("tf_scity_l", "tf_scity_r") >= coalesce("tf_scity_r", "tf_scity_l")
THEN coalesce("tf_scity_l", "tf_scity_r")
ELSE coalesce("tf_scity_r", "tf_scity_l")
END)
,
cast(1.0 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_scity = 0 then cast(1 as float8) END as bf_tf_adj_scity ,gamma_szip5,CASE
WHEN
gamma_szip5 = -1
THEN cast(1.0 as float8)
WHEN
gamma_szip5 = 4
THEN cast(245.60618690986593 as float8)
WHEN
gamma_szip5 = 3
THEN cast(0.3741913067691637 as float8)
WHEN
gamma_szip5 = 2
THEN cast(0.23762349361145213 as float8)
WHEN
gamma_szip5 = 1
THEN cast(2.973602866192421e-13 as float8)
WHEN
gamma_szip5 = 0
THEN cast(6.578648872187775e-15 as float8)
END as bf_szip5 ,CASE WHEN gamma_szip5 = -1 then cast(1 as float8) WHEN gamma_szip5 = 4 then
(CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
THEN
POW(
cast(0.003993214383316761 as float8) /
(CASE
WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
THEN coalesce("tf_szip5_l", "tf_szip5_r")
WHEN coalesce("tf_szip5_r", "tf_szip5_l") > cast(0.001 as float8)
THEN coalesce("tf_szip5_r", "tf_szip5_l")
ELSE cast(0.001 as float8)
END)
,
cast(1.0 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_szip5 = 3 then
(CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
THEN
POW(
cast(0.003993214383316761 as float8) /
(CASE
WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
THEN coalesce("tf_szip5_l", "tf_szip5_r")
WHEN coalesce("tf_szip5_r", "tf_szip5_l") > cast(0.001 as float8)
THEN coalesce("tf_szip5_r", "tf_szip5_l")
ELSE cast(0.001 as float8)
END)
,
cast(0.5 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_szip5 = 2 then
(CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
THEN
POW(
cast(0.003993214383316761 as float8) /
(CASE
WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
THEN coalesce("tf_szip5_l", "tf_szip5_r")
WHEN coalesce("tf_szip5_r", "tf_szip5_l") > cast(0.001 as float8)
THEN coalesce("tf_szip5_r", "tf_szip5_l")
ELSE cast(0.001 as float8)
END)
,
cast(0.5 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_szip5 = 1 then
(CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
THEN
POW(
cast(0.003993214383316761 as float8) /
(CASE
WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
THEN coalesce("tf_szip5_l", "tf_szip5_r")
WHEN coalesce("tf_szip5_r", "tf_szip5_l") > cast(0.001 as float8)
THEN coalesce("tf_szip5_r", "tf_szip5_l")
ELSE cast(0.001 as float8)
END)
,
cast(0.5 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_szip5 = 0 then cast(1 as float8) END as bf_tf_adj_szip5
from __splink__df_comparison_vectors
)
select
log2(cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5) as match_weight,
CASE WHEN bf_street = cast('infinity' as float8) OR bf_scity = cast('infinity' as float8) OR bf_tf_adj_scity = cast('infinity' as float8) OR bf_szip5 = cast('infinity' as float8) OR bf_tf_adj_szip5 = cast('infinity' as float8) THEN 1.0 ELSE (cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5)/(1+(cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5)) END as match_probability,
"addid_l","addid_r"
from __splink__df_match_weight_parts
-------End SQL-----------
Setting cache for __splink__df_self_link_033da4740 with physical name __splink__df_self_link_033da4740
SQL pipeline was passed inputs [] and output dataset __splink__df_unlinkables_proportions_cumulative
Pipeline part 1: CTE reads tables [ __splink__df_self_link_033da4740 ] and has output table name: __splink__df_round_self_link
Pipeline part 2: CTE reads tables [ __splink__df_round_self_link ] and has output table name: __splink__df_unlinkables_proportions
Pipeline part 3: CTE reads tables [ __splink__df_unlinkables_proportions ] and has output table name: __splink__df_unlinkables_proportions_cumulative
Executing sql to create __splink__df_unlinkables_proportions_cumulative as physical name __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1
------Start SQL---------
CREATE TABLE __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1 AS
WITH
__splink__df_round_self_link as (
select
round(match_weight, 2) as match_weight,
round(match_probability, 5) as match_probability
from __splink__df_self_link_033da4740
),
__splink__df_unlinkables_proportions as (
select
max(match_weight) as match_weight,
match_probability,
count(*) / cast( sum(count(*)) over () as float) as prop
from __splink__df_round_self_link
group by match_probability
order by match_probability
)
select *,
sum(prop) over(order by match_probability) as cum_prop
from __splink__df_unlinkables_proportions
where match_probability < 1
-------End SQL-----------
Setting cache for __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1 with physical name __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1
Dropping table with templated name __splink__df_unlinkables_proportions_cumulative and physical name __splink__df_unlinkables_proportions_cumulative_5ee4fc4f1
When I increase the number of rows to 1 million, the process begins to drag on and seems to get stuck about here: Here is the logging of that unsuccessful SQL querySQL pipeline was passed inputs [] and output dataset __splink__df_concat_with_tf
Pipeline part 1: CTE reads tables [ regrid_sample ] and has output table name: __splink__df_concat
Pipeline part 2: CTE reads tables [ __splink__df_concat ] and has output table name: __splink__df_tf_szip5
Pipeline part 3: CTE reads tables [ __splink__df_concat ] and has output table name: __splink__df_tf_scity
Pipeline part 4: CTE reads tables [ __splink__df_tf_scity, __splink__df_tf_szip5, __splink__df_concat ] and has output table name: __splink__df_concat_with_tf
Executing sql to create __splink__df_concat_with_tf as physical name __splink__df_concat_with_tf_41f206f06
------Start SQL---------
CREATE TABLE __splink__df_concat_with_tf_41f206f06 AS
WITH
__splink__df_concat as (
select "addid", "saddno", "saddpref", "saddstr", "saddsttyp", "saddstsuf", "sunit", "scity", "szip5", "accuracy", "geohash_self_and_neighbors8", "geohash_self_and_neighbors7", "source"
, random() as __splink_salt
from regrid_sample
),
__splink__df_tf_szip5 as (
select
"szip5", cast(count(*) as float8) / (select
count("szip5") as total from __splink__df_concat)
as "tf_szip5"
from __splink__df_concat
where "szip5" is not null
group by "szip5"
),
__splink__df_tf_scity as (
select
"scity", cast(count(*) as float8) / (select
count("scity") as total from __splink__df_concat)
as "tf_scity"
from __splink__df_concat
where "scity" is not null
group by "scity"
)
select __splink__df_concat.*, __splink__df_tf_szip5."tf_szip5", __splink__df_tf_scity."tf_scity"
from __splink__df_concat
left join __splink__df_tf_szip5 on __splink__df_concat."szip5" = __splink__df_tf_szip5."szip5" left join __splink__df_tf_scity on __splink__df_concat."scity" = __splink__df_tf_scity."scity"
-------End SQL-----------
Setting cache for __splink__df_concat_with_tf_41f206f06 with physical name __splink__df_concat_with_tf_41f206f06
Setting cache for __splink__df_concat_with_tf with physical name __splink__df_concat_with_tf_41f206f06
SQL pipeline was passed inputs [__splink__df_concat_with_tf_41f206f06] and output dataset __splink__blocked_id_pairs
Pipeline part 1: CTE reads tables [ __splink__df_concat_with_tf_41f206f06 ] and has output table name: __splink__df_concat_with_tf
Pipeline part 2: CTE reads tables [ __splink__df_concat_with_tf AS r, __splink__df_concat_with_tf AS l ] and has output table name: __splink__blocked_id_pairs
Executing sql to create __splink__blocked_id_pairs as physical name __splink__blocked_id_pairs_0d1339724
------Start SQL---------
CREATE TABLE __splink__blocked_id_pairs_0d1339724 AS
WITH
__splink__df_concat_with_tf as (
select * from __splink__df_concat_with_tf_41f206f06)
select
'0' as match_key,
l."addid" as join_key_l,
r."addid" as join_key_r
from __splink__df_concat_with_tf as l
inner join __splink__df_concat_with_tf as r
on
(l."addid" = r."addid")
where 1=1
-------End SQL-----------
Setting cache for __splink__blocked_id_pairs_0d1339724 with physical name __splink__blocked_id_pairs_0d1339724
SQL pipeline was passed inputs [__splink__blocked_id_pairs_0d1339724, __splink__df_concat_with_tf_41f206f06] and output dataset __splink__df_self_link
Pipeline part 1: CTE reads tables [ __splink__blocked_id_pairs_0d1339724 ] and has output table name: __splink__blocked_id_pairs
Pipeline part 2: CTE reads tables [ __splink__df_concat_with_tf_41f206f06 ] and has output table name: __splink__df_concat_with_tf
Pipeline part 3: CTE reads tables [ __splink__df_concat_with_tf AS r, __splink__blocked_id_pairs AS b, __splink__df_concat_with_tf AS l ] and has output table name: blocked_with_cols
Pipeline part 4: CTE reads tables [ blocked_with_cols ] and has output table name: __splink__df_comparison_vectors
Pipeline part 5: CTE reads tables [ __splink__df_comparison_vectors ] and has output table name: __splink__df_match_weight_parts
Pipeline part 6: CTE reads tables [ __splink__df_match_weight_parts ] and has output table name: __splink__df_self_link
Executing sql to create __splink__df_self_link as physical name __splink__df_self_link_255e31a79
------Start SQL---------
CREATE TABLE __splink__df_self_link_255e31a79 AS
WITH
__splink__blocked_id_pairs as (
select * from __splink__blocked_id_pairs_0d1339724),
__splink__df_concat_with_tf as (
select * from __splink__df_concat_with_tf_41f206f06),
blocked_with_cols as (
select "l"."addid" AS "addid_l",
"r"."addid" AS "addid_r",
"l"."geohash_self_and_neighbors7" AS "geohash_self_and_neighbors7_l",
"r"."geohash_self_and_neighbors7" AS "geohash_self_and_neighbors7_r",
"l"."accuracy" AS "accuracy_l",
"r"."accuracy" AS "accuracy_r",
"l"."saddstr" AS "saddstr_l",
"r"."saddstr" AS "saddstr_r",
"l"."geohash_self_and_neighbors8" AS "geohash_self_and_neighbors8_l",
"r"."geohash_self_and_neighbors8" AS "geohash_self_and_neighbors8_r",
"l"."scity" AS "scity_l",
"r"."scity" AS "scity_r",
"l"."tf_scity" AS "tf_scity_l",
"r"."tf_scity" AS "tf_scity_r",
"l"."szip5" AS "szip5_l",
"r"."szip5" AS "szip5_r",
"l"."tf_szip5" AS "tf_szip5_l",
"r"."tf_szip5" AS "tf_szip5_r", b.match_key
from __splink__df_concat_with_tf as l
inner join __splink__blocked_id_pairs as b
on l."addid" = b.join_key_l
inner join __splink__df_concat_with_tf as r
on r."addid" = b.join_key_r
),
__splink__df_comparison_vectors as (
select "addid_l",
"addid_r",
CASE WHEN (saddstr_l IS NULL OR saddstr_r IS NULL) AND (geohash_self_and_neighbors7_l IS NULL OR geohash_self_and_neighbors7_r IS NULL OR accuracy_l IS NULL OR accuracy_r IS NULL) THEN -1 WHEN (saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 10 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 9 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 8 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL)) THEN 7 WHEN (saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 6 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 5 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 4 WHEN (jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) THEN 3 WHEN array_length(list_intersect(geohash_self_and_neighbors8_l, geohash_self_and_neighbors8_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL) THEN 2 WHEN array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL) THEN 1 ELSE 0 END as gamma_street,
CASE WHEN "scity_l" IS NULL OR "scity_r" IS NULL THEN -1 WHEN "scity_l" = "scity_r" THEN 1 ELSE 0 END as gamma_scity,
"tf_scity_l",
"tf_scity_r",
CASE WHEN szip5_l IS NULL OR szip5_r IS NULL THEN -1 WHEN szip5_l = szip5_r THEN 4 WHEN substr(szip5_l, 1, 4) = substr(szip5_r, 1, 4) THEN 3 WHEN substr(szip5_l, 1, 3) = substr(szip5_r, 1, 3) THEN 2 WHEN damerau_levenshtein(szip5_l, szip5_r) <= 2 THEN 1 ELSE 0 END as gamma_szip5,
"tf_szip5_l",
"tf_szip5_r",
match_key
from blocked_with_cols
),
__splink__df_match_weight_parts as (
select "addid_l","addid_r",gamma_street,CASE
WHEN
gamma_street = -1
THEN cast(1.0 as float8)
WHEN
gamma_street = 10
THEN cast(49678.52911992316 as float8)
WHEN
gamma_street = 9
THEN cast(21536.078003967268 as float8)
WHEN
gamma_street = 8
THEN cast(31479.454181137276 as float8)
WHEN
gamma_street = 7
THEN cast(42510.434297932894 as float8)
WHEN
gamma_street = 6
THEN cast(2223.603135817533 as float8)
WHEN
gamma_street = 5
THEN cast(230.75868436173087 as float8)
WHEN
gamma_street = 4
THEN cast(113.05510082601522 as float8)
WHEN
gamma_street = 3
THEN cast(85.75861558013442 as float8)
WHEN
gamma_street = 2
THEN cast(14.205441987736027 as float8)
WHEN
gamma_street = 1
THEN cast(65.00410034352372 as float8)
WHEN
gamma_street = 0
THEN cast(0.0006700287948115862 as float8)
END as bf_street ,gamma_scity,CASE
WHEN
gamma_scity = -1
THEN cast(1.0 as float8)
WHEN
gamma_scity = 1
THEN cast(20.338205372111965 as float8)
WHEN
gamma_scity = 0
THEN cast(0.04420372330105158 as float8)
END as bf_scity ,CASE WHEN gamma_scity = -1 then cast(1 as float8) WHEN gamma_scity = 1 then
(CASE WHEN coalesce("tf_scity_l", "tf_scity_r") is not null
THEN
POW(
cast(0.04709747703972181 as float8) /
(CASE
WHEN coalesce("tf_scity_l", "tf_scity_r") >= coalesce("tf_scity_r", "tf_scity_l")
THEN coalesce("tf_scity_l", "tf_scity_r")
ELSE coalesce("tf_scity_r", "tf_scity_l")
END)
,
cast(1.0 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_scity = 0 then cast(1 as float8) END as bf_tf_adj_scity ,gamma_szip5,CASE
WHEN
gamma_szip5 = -1
THEN cast(1.0 as float8)
WHEN
gamma_szip5 = 4
THEN cast(245.60618690986593 as float8)
WHEN
gamma_szip5 = 3
THEN cast(0.3741913067691637 as float8)
WHEN
gamma_szip5 = 2
THEN cast(0.23762349361145213 as float8)
WHEN
gamma_szip5 = 1
THEN cast(2.973602866192421e-13 as float8)
WHEN
gamma_szip5 = 0
THEN cast(6.578648872187775e-15 as float8)
END as bf_szip5 ,CASE WHEN gamma_szip5 = -1 then cast(1 as float8) WHEN gamma_szip5 = 4 then
(CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
THEN
POW(
cast(0.003993214383316761 as float8) /
(CASE
WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
THEN coalesce("tf_szip5_l", "tf_szip5_r")
WHEN coalesce("tf_szip5_r", "tf_szip5_l") > cast(0.001 as float8)
THEN coalesce("tf_szip5_r", "tf_szip5_l")
ELSE cast(0.001 as float8)
END)
,
cast(1.0 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_szip5 = 3 then
(CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
THEN
POW(
cast(0.003993214383316761 as float8) /
(CASE
WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
THEN coalesce("tf_szip5_l", "tf_szip5_r")
WHEN coalesce("tf_szip5_r", "tf_szip5_l") > cast(0.001 as float8)
THEN coalesce("tf_szip5_r", "tf_szip5_l")
ELSE cast(0.001 as float8)
END)
,
cast(0.5 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_szip5 = 2 then
(CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
THEN
POW(
cast(0.003993214383316761 as float8) /
(CASE
WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
THEN coalesce("tf_szip5_l", "tf_szip5_r")
WHEN coalesce("tf_szip5_r", "tf_szip5_l") > cast(0.001 as float8)
THEN coalesce("tf_szip5_r", "tf_szip5_l")
ELSE cast(0.001 as float8)
END)
,
cast(0.5 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_szip5 = 1 then
(CASE WHEN coalesce("tf_szip5_l", "tf_szip5_r") is not null
THEN
POW(
cast(0.003993214383316761 as float8) /
(CASE
WHEN coalesce("tf_szip5_l", "tf_szip5_r") >= coalesce("tf_szip5_r", "tf_szip5_l")
AND coalesce("tf_szip5_l", "tf_szip5_r") > cast(0.001 as float8)
THEN coalesce("tf_szip5_l", "tf_szip5_r")
WHEN coalesce("tf_szip5_r", "tf_szip5_l") > cast(0.001 as float8)
THEN coalesce("tf_szip5_r", "tf_szip5_l")
ELSE cast(0.001 as float8)
END)
,
cast(0.5 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_szip5 = 0 then cast(1 as float8) END as bf_tf_adj_szip5
from __splink__df_comparison_vectors
)
select
log2(cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5) as match_weight,
CASE WHEN bf_street = cast('infinity' as float8) OR bf_scity = cast('infinity' as float8) OR bf_tf_adj_scity = cast('infinity' as float8) OR bf_szip5 = cast('infinity' as float8) OR bf_tf_adj_szip5 = cast('infinity' as float8) THEN 1.0 ELSE (cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5)/(1+(cast(5.58086036478609e-07 as float8) * bf_street * bf_scity * bf_tf_adj_scity * bf_szip5 * bf_tf_adj_szip5)) END as match_probability,
"addid_l","addid_r"
from __splink__df_match_weight_parts
-------End SQL-----------
Thanks again for the help! P.S. Unfortunately I will miss your presentation at Princeton this week but I am encouraging all of my colleagues to attend. |
Beta Was this translation helpful? Give feedback.
All reactions
-
There's nothing that jumps out at me from the SQL. I have two hypotheses:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm thinking the issue might be either a different problem with duckdb or an issue with the computation. I updated to duckdb 1.1.2 without any new luck, and I'm not seeing any progress with using |
Beta Was this translation helpful? Give feedback.
All reactions
-
I've had another look. The following script creates 1 million fake records according to your schema and then produces the unlinkables chart: Click to expandimport pandas as pd
import time
from splink import DuckDBAPI, Linker
db_api = DuckDBAPI()
data = [
{
"saddno": "123",
"saddstr": "Main St",
"scity": "New York",
"szip5": "10001",
"geohash_self_and_neighbors7": ["dr5r7p6", "dr5r7p6"],
"accuracy": 0.98,
"geohash_self_and_neighbors8": ["dr5r7p63", "dr5r7p63"],
},
{
"saddno": "456",
"saddstr": "Broadway",
"scity": "Los Angeles",
"szip5": "90001",
"geohash_self_and_neighbors7": ["9q5cgvc"],
"accuracy": 0.95,
"geohash_self_and_neighbors8": ["9q5cgvcc", "9q5cgvcc"],
},
{
"saddno": "789",
"saddstr": "Elm St",
"scity": "Chicago",
"szip5": "60601",
"geohash_self_and_neighbors7": ["dp3wj1j", "dp3wj1j"],
"accuracy": 0.92,
"geohash_self_and_neighbors8": ["dp3wj1j8", "dp3wj1j8"],
},
]
df = pd.DataFrame(data)
# Create 1 million rows by repeating the existing rows
df = pd.concat([df] * (10**6 // len(df)), ignore_index=True)
# Reset the index and use it as the 'addid' column
df = df.reset_index(drop=True)
df["addid"] = df.index
df = df.copy()
settings_dict = {
"link_type": "dedupe_only",
"probability_two_random_records_match": 5.580857250187588e-07,
"retain_matching_columns": False,
"retain_intermediate_calculation_columns": False,
"additional_columns_to_retain": ["addid"],
"sql_dialect": "duckdb",
"linker_uid": "zw7adf0n",
"em_convergence": 0.0001,
"max_iterations": 25,
"bayes_factor_column_prefix": "bf_",
"term_frequency_adjustment_column_prefix": "tf_",
"comparison_vector_value_column_prefix": "gamma_",
"unique_id_column_name": "addid",
"source_dataset_column_name": "source_dataset",
"blocking_rules_to_generate_predictions": [
{
"blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.scity = r.scity AND l.szip5 = r.szip5",
"sql_dialect": "duckdb",
},
{
"blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.scity = r.scity",
"sql_dialect": "duckdb",
},
{
"blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr AND l.szip5 = r.szip5",
"sql_dialect": "duckdb",
},
{
"blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr AND substr(l.szip5,1,4) = substr(r.szip5,1,4)",
"sql_dialect": "duckdb",
},
{
"blocking_rule": "l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.scity = r.scity AND l.szip5 = r.szip5",
"sql_dialect": "duckdb",
},
{
"blocking_rule": "l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.scity = r.scity",
"sql_dialect": "duckdb",
},
{
"blocking_rule": "l.saddno = r.saddno AND substr(l.saddstr,1,1) = substr(r.saddstr,1,1) AND l.szip5 = r.szip5",
"sql_dialect": "duckdb",
},
{
"blocking_rule": "l.saddno = r.saddno AND l.saddstr = r.saddstr",
"sql_dialect": "duckdb",
},
{
"blocking_rule": "l.saddno = r.saddno AND l.scity = r.scity",
"sql_dialect": "duckdb",
},
{
"blocking_rule": "l.saddno = r.saddno AND l.szip5 = r.szip5",
"sql_dialect": "duckdb",
},
],
"comparisons": [
{
"output_column_name": "street",
"comparison_levels": [
{
"sql_condition": "(saddstr_l IS NULL OR saddstr_r IS NULL) AND (geohash_self_and_neighbors7_l IS NULL OR geohash_self_and_neighbors7_r IS NULL OR accuracy_l IS NULL OR accuracy_r IS NULL)",
"label_for_charts": "Null",
"is_null_level": True,
},
{
"sql_condition": "(saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))",
"label_for_charts": "Exact match (saddstr) and Intersect (geohash 7)",
"m_probability": 0.5184678962113823,
"u_probability": 1.0436458272744132e-05,
},
{
"sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))",
"label_for_charts": "Jaro-Winkler (saddstr) >= 0.92 and Intersect (geohash 7)",
"m_probability": 0.011895738173284133,
"u_probability": 5.523632562573723e-07,
},
{
"sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))",
"label_for_charts": "Jaro-Winkler (saddstr) >= 0.88 and Intersect (geohash 7)",
"m_probability": 0.019878517576250577,
"u_probability": 6.314759290890734e-07,
},
{
"sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL)) AND (array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL))",
"label_for_charts": "Jaro-Winkler (saddstr) >= 0.84 and Intersect (geohash 7)",
"m_probability": 0.018633469905727222,
"u_probability": 4.3832697109455996e-07,
},
{
"sql_condition": "(saddstr_l = saddstr_r AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))",
"label_for_charts": "Exact match (saddstr)",
"m_probability": 0.3320169362895465,
"u_probability": 0.0001493148354314929,
},
{
"sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.92 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))",
"label_for_charts": "Jaro-Winkler (saddstr) >= 0.92",
"m_probability": 0.019878741014857884,
"u_probability": 8.614514799233525e-05,
},
{
"sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.88 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))",
"label_for_charts": "Jaro-Winkler (saddstr) >= 0.88",
"m_probability": 0.03469364631438866,
"u_probability": 0.0003068737815534748,
},
{
"sql_condition": "(jaro_winkler_similarity(saddstr_l, saddstr_r) >= 0.84 AND (saddstr_l IS NOT NULL AND saddstr_r IS NOT NULL))",
"label_for_charts": "Jaro-Winkler (saddstr) >= 0.84",
"m_probability": 0.04372303645073314,
"u_probability": 0.0005098384127933774,
},
{
"sql_condition": "array_length(list_intersect(geohash_self_and_neighbors8_l, geohash_self_and_neighbors8_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL)",
"label_for_charts": "Intersect (geohash 8)",
"m_probability": 1.7211818229633315e-06,
"u_probability": 1.211635529854881e-07,
},
{
"sql_condition": "array_length(list_intersect(geohash_self_and_neighbors7_l, geohash_self_and_neighbors7_r)) > 0 AND (accuracy_l IS NOT NULL AND accuracy_r IS NOT NULL) AND (saddstr_l IS NULL OR saddstr_r IS NULL)",
"label_for_charts": "Intersect (geohash 7)",
"m_probability": 0.00014098268683681274,
"u_probability": 2.168827598440237e-06,
},
{
"sql_condition": "ELSE",
"label_for_charts": "All other comparisons",
"m_probability": 0.0006693141951697755,
"u_probability": 0.9989334792066487,
},
],
"comparison_description": "CustomComparison",
},
{
"output_column_name": "scity",
"comparison_levels": [
{
"sql_condition": '"scity_l" IS NULL OR "scity_r" IS NULL',
"label_for_charts": "scity is NULL",
"is_null_level": True,
},
{
"sql_condition": '"scity_l" = "scity_r"',
"label_for_charts": "Exact match on scity",
"m_probability": 0.95787816054219,
"u_probability": 0.04709747703972181,
"tf_adjustment_column": "scity",
"tf_adjustment_weight": 1.0,
},
{
"sql_condition": "ELSE",
"label_for_charts": "All other comparisons",
"m_probability": 0.04212183945781009,
"u_probability": 0.9529025229602782,
},
],
"comparison_description": "ExactMatch",
},
{
"output_column_name": "szip5",
"comparison_levels": [
{
"sql_condition": "szip5_l IS NULL OR szip5_r IS NULL",
"label_for_charts": "Null",
"is_null_level": True,
},
{
"sql_condition": "szip5_l = szip5_r",
"label_for_charts": "Exact match",
"m_probability": 0.9807581582000615,
"u_probability": 0.003993214383316761,
"tf_adjustment_column": "szip5",
"tf_minimum_u_value": 0.001,
"tf_adjustment_weight": 1.0,
},
{
"sql_condition": "substr(szip5_l, 1, 4) = substr(szip5_r, 1, 4)",
"label_for_charts": "Exact match (zip4)",
"m_probability": 0.005447918971943777,
"u_probability": 0.01455918102155314,
"tf_adjustment_column": "szip5",
"tf_minimum_u_value": 0.001,
"tf_adjustment_weight": 0.5,
},
{
"sql_condition": "substr(szip5_l, 1, 3) = substr(szip5_r, 1, 3)",
"label_for_charts": "Exact match (zip3)",
"m_probability": 0.013793922827954863,
"u_probability": 0.058049490891291536,
"tf_adjustment_column": "szip5",
"tf_minimum_u_value": 0.001,
"tf_adjustment_weight": 0.5,
},
{
"sql_condition": "damerau_levenshtein(szip5_l, szip5_r) <= 2",
"label_for_charts": "Damerau-Levenshtein <= 2",
"m_probability": 3.460899719747275e-14,
"u_probability": 0.11638742210989385,
"tf_adjustment_column": "szip5",
"tf_minimum_u_value": 0.001,
"tf_adjustment_weight": 0.5,
},
{
"sql_condition": "ELSE",
"label_for_charts": "All other comparisons",
"m_probability": 5.30903997609798e-15,
"u_probability": 0.8070106915939447,
},
],
"comparison_description": "CustomComparison",
},
],
}
linker = Linker(df, settings_dict, db_api)
start_time = time.time()
linker.evaluation.unlinkables_chart()
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
It executes successfully on my laptop, with the Splink part taking 6 seconds In terms of data size - that shouldn't be causing an issue here. The unlinkables chart produces a self link, so with a million reocrds it's only having to evaluate a million comparisons. All I can't think of is there's something in your dataset that's causing a problem with duckdb |
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for taking a deeper look at this. I'll have to come back to it later—I've been sidetracked with other things. I'm a bit at a loss how to reproduce the error (if there is one) but I will follow up if I can figure out the cause. |
Beta Was this translation helpful? Give feedback.
-
Hi!
I'm trying to produce some evaluation charts (like the unlinkables chart) for a Splink model I ran and saved, but instead the code is running for ages and nothing is coming out. I would usually include a reproducible example, but I'm not getting any errors and the same code in fact works for other models I have run — and very quickly — it's just that this model in particular can't produce any evaluation charts.
There doesn't seem to be any substantial difference between the saved Splink models for which I can produce the evaluation charts and for which I cannot.
The code looks like this. It's meant to save all the evaluation charts as JSONs for several saved Splink models that all have separate associated DuckDB files containing the records that were linked. (Across the DuckDB connections, the tables all happen to have the same names because I partitioned a large national dataset of US addresses by state before running the same Splink model training code for each partition).
I am running this code on a machine with 8 cores and 90G of RAM. Some of the charts are created and saved in less than a few seconds, whereas the ones I'm having trouble with are taking an indefinite amount of time, and I have had this running for hours. To be clear, I know that there aren't any errors being suppressed by the for-loop because I ran the code for the problematic Splink models outside of the for-loop.
My first question is whether there is anything obviously wrong with this code?
My second question is whether I should share the JSON for the saved model that is giving me trouble?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions