Training vs Predictions when datasize changes predications are erroneous. #2244

vfrank66 · 2024-07-10T15:46:35Z

vfrank66
Jul 10, 2024

I have a training dataset with a size of 10 million which creates a model output that with the property "probability_two_random_records_match": 1.12398ge-7. When I run the saved model in predictions only on a subset of data 500 records the model finds no predictions. Even though I have 2 records 100% matching. If I change the saved model manually to "probability_two_random_records_match": 0.1 and run against 500 records I have matches, but my match weights also drastically changed, and match probablity.

To me this is unexpected behavior and not ideal, mainly because I have two records that should have actually deterministically matched but did not even produce results when the probability_two_random_records_match is trained to be low and the dataset is smaller than training. I expected the two exact records to match but they did not. What is going on here?

RobinL · 2024-07-10T16:08:04Z

RobinL
Jul 10, 2024
Maintainer

When saving and reloading a model, the predictions you get should be the same.

The only caveat is that term frequencies are, by default, computed from the data you pass into Splink, so if you pass in only 500 values, then they might be badly wrong.

If this is the cause of the problem, you'd want to manually load in representative term frequencty tables.

You can compute term frequeny tables using a linker that has the full dataset passed in using e.g.

linker.compute_tf_table("first_name")

and load them in using

new_linker.register_term_frequency_lookup(pd_dataframe, "first_name")

2 replies

hadoan88 Jul 11, 2024

@RobinL I have a question - Do we need to compute TF table every time before we call find_matches_to_new_records or match_two_records? or just compute only once? Thanks in advance.

RobinL Jul 11, 2024
Maintainer

Just once!

vfrank66 · 2024-07-10T18:52:19Z

vfrank66
Jul 10, 2024
Author

Humm okay that is a little difficult for my process maybe I will turn off term frequencies. Although still not clear to me why two exactly matching records would not return whether term frequency is accurate or inaccurate.

I intended to create a model on a full static dataset. Then with the saved model I would use predicate pushdown against my "golden" record set for incremental data coming in. golden dataset is 20 -100 mil but incremental may only be a 50-100K a day. To increase the feedback loop on incremental data predications, that is evented not batched, I was going to using the model blocking rules to pre-filter the golden dataset for DuckDBLinker() and allow me to use a small machine to process incremental data. While also me to also move around the model into a batch processing on days we get 1 million records a day.

Or I may just mess around with the probability_two_random_records_match and find a proportion which I like because that does work somewhat although I worry about consistency. If I save the term frequency tables of which I have 5 it will make my transactional process too large to be efficient at transactional processing.

1 reply

RobinL Jul 10, 2024
Maintainer

I think it's sufficient for the duckdb database to have a term frequency table in it with the correct name, so on your case you could save a duckdb database file with the term frequencies pre loaded. That might help with latency for your use case.

vfrank66 · 2024-07-23T21:04:04Z

vfrank66
Jul 23, 2024
Author

For future readers, if you reduce the input_table_or_tables in

            linker = DuckDBLinker(
                input_table_or_tables=input_tables,
                settings_dict=in_transaction_model,
                connection=con,
                set_up_basic_logging=True,
                input_table_aliases=None,
                validate_settings=True,
            )

You have to change the probability_two_random_records_match whether you have term frequencies on or off. In my case originally saved_model["probability_two_random_records_match"] = 1.0e-7 on 20 million records while training, but during a smaller transaction on 200 records I have to set saved_model["probability_two_random_records_match"] = 0.1. Otherwise the weighted_match is soo low it is under -4 and thus does not show any matches at all. I do not know why probability_two_random_records_match is directly attached to the bf on each column but it is. I think this is unexpected behavior but it is being predicted on a dataset that is not trained.

3 replies

RobinL Jul 23, 2024
Maintainer

The score for a given record pairwise record comparison should be identical irrespective if the size of the input data, assuming the same model and that you register and use the same term frequency tables.

One way to debug the source of the problem would be to generate a waterfall chart for the same comparison for the larger and smaller dataset - they should be identical

My guess is what's happening is the term frequency tables are being computed against the 200 records, resulting in much weaker match weights. Another way to check this is to turn logging on at level 1. Youll be able to see the sql that's executed, which will probably include the tf tables being calculated

vfrank66 Jul 24, 2024
Author

I will debug through to identify the source of the problem.

But I already removed the term frequency from the trained model so it is not related to saved term frequency state:

        keys_to_remove = ["tf_adjustment_column", "tf_adjustment_weight"]

        for comp in saved_model["comparisons"]:
            for level in comp["comparison_levels"]:
                for key in keys_to_remove:
                    level.pop(key, None)

vfrank66 Jul 25, 2024
Author

Your guess is correct the problem was term frequency tables are being computed against the 200 records, resulting in much weaker match weights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training vs Predictions when datasize changes predications are erroneous. #2244

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Training vs Predictions when datasize changes predications are erroneous. #2244

vfrank66 Jul 10, 2024

Replies: 3 comments · 6 replies

RobinL Jul 10, 2024 Maintainer

hadoan88 Jul 11, 2024

RobinL Jul 11, 2024 Maintainer

vfrank66 Jul 10, 2024 Author

RobinL Jul 10, 2024 Maintainer

vfrank66 Jul 23, 2024 Author

RobinL Jul 23, 2024 Maintainer

vfrank66 Jul 24, 2024 Author

vfrank66 Jul 25, 2024 Author

vfrank66
Jul 10, 2024

Replies: 3 comments 6 replies

RobinL
Jul 10, 2024
Maintainer

RobinL Jul 11, 2024
Maintainer

vfrank66
Jul 10, 2024
Author

RobinL Jul 10, 2024
Maintainer

vfrank66
Jul 23, 2024
Author

RobinL Jul 23, 2024
Maintainer

vfrank66 Jul 24, 2024
Author

vfrank66 Jul 25, 2024
Author