Add ppe method for predictive elicitation (experimental) #336
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Early draft for a predictive elicitation method. I only tested on a couple of simple models, I already know it will fail for others, this is just a proof of concept.
The main idea is that the user provides a model (currently only PyMC, adding Bambi should be easy) and a "target distribution". This distribution is not any particular data set, but the "not yet observed data". The author of Understanding Advanced Statistical Methods calls this "DATA" as opposed to "data" (the dataset I want to "fit"). So if my model is about the height of adults in San Luis (from which I got a sample, i.e. my data). I can use my domain knowledge of adult humans (DATA) to elicit the target distribution.
A summary of the algorithm is:
Generate a sample from the target distribution.
Maximize the model's likelihood to that sample (i.e. we find the parameters for a fixed "observation").
Generate a new sample from the target and repeat.
Collect the optimized values in an array (one per prior parameter in the original model).
Use MLE to fit the optimized values to their corresponding families in the original model.
This approach is similar to what we do in Kulprit. One difference is that for kulprit the "target" is actually the posterior predictive distribution of a reference model, and we are interested in finding submodels (and their psoteriors) that will induce predictions as close as possible to the predictions from the reference model. Here we don't have a reference model, we instead have a human (or potentially a few humans). The other difference is that for kulrpit the optimized values are an approximation to the posterior that we care, here we need to fit those values to the prior's families in the original model, because we can not use samples as priors in a PyMC (or other PPLs) model.
The other difference is that here we use a slightly different approach to obtain the likelihood function for the optimization routine. If this can be generalized, we can use it in Kulprit too, I think this approach was not available when we discussed Kulprit's design and it could potentially make the code easier to maintain and extend.