Replies: 3 comments 6 replies
-
When saving and reloading a model, the predictions you get should be the same. The only caveat is that term frequencies are, by default, computed from the data you pass into Splink, so if you pass in only 500 values, then they might be badly wrong. If this is the cause of the problem, you'd want to manually load in representative term frequencty tables. You can compute term frequeny tables using a linker that has the full dataset passed in using e.g.
and load them in using
|
Beta Was this translation helpful? Give feedback.
-
Humm okay that is a little difficult for my process maybe I will turn off term frequencies. Although still not clear to me why two exactly matching records would not return whether term frequency is accurate or inaccurate. I intended to create a model on a full static dataset. Then with the saved model I would use predicate pushdown against my "golden" record set for incremental data coming in. golden dataset is 20 -100 mil but incremental may only be a 50-100K a day. To increase the feedback loop on incremental data predications, that is evented not batched, I was going to using the model blocking rules to pre-filter the golden dataset for DuckDBLinker() and allow me to use a small machine to process incremental data. While also me to also move around the model into a batch processing on days we get 1 million records a day. Or I may just mess around with the probability_two_random_records_match and find a proportion which I like because that does work somewhat although I worry about consistency. If I save the term frequency tables of which I have 5 it will make my transactional process too large to be efficient at transactional processing. |
Beta Was this translation helpful? Give feedback.
-
For future readers, if you reduce the input_table_or_tables in linker = DuckDBLinker(
input_table_or_tables=input_tables,
settings_dict=in_transaction_model,
connection=con,
set_up_basic_logging=True,
input_table_aliases=None,
validate_settings=True,
) You have to change the |
Beta Was this translation helpful? Give feedback.
-
I have a training dataset with a size of 10 million which creates a model output that with the property "probability_two_random_records_match": 1.12398ge-7. When I run the saved model in predictions only on a subset of data 500 records the model finds no predictions. Even though I have 2 records 100% matching. If I change the saved model manually to "probability_two_random_records_match": 0.1 and run against 500 records I have matches, but my match weights also drastically changed, and match probablity.
To me this is unexpected behavior and not ideal, mainly because I have two records that should have actually deterministically matched but did not even produce results when the probability_two_random_records_match is trained to be low and the dataset is smaller than training. I expected the two exact records to match but they did not. What is going on here?
Beta Was this translation helpful? Give feedback.
All reactions