CausalQueriesTools
is currently under development and very much
experimental. It is intended to be a companion package to
CausalQueries
, implementing optimized core CausalQueries
methods to
make, update and query very large causal models.
CausalQueries CRAN CausalQueries GitHub Handbook
- Installation
- Usage
- Methods & Benchmarks
3.1 Motivation
3.2 Creating large causal models
3.3 Updating large causal models
3.4 Querying large causal models - Development
You can install the development version of CausalQueriesTools from GitHub with:
# install.packages("devtools")
devtools::install_github("till-tietz/CausalQueriesTools")
library(CausalQueries)
library(CausalQueriesTools)
# create a simple model assuming monotonicity and no interactions
# add select interactions back into the model
# two way interactions of M1 & M2 on Y and M3 & M4 on Y
model <- make_simple_model("X -> M1; X -> M2; X -> M3; X -> M4;
M1 -> Y; M2 -> Y; M3 -> Y; M4 -> Y") |>
interact_model(interactions = list(c("M1","M2","Y"),c("M3","M4","Y")))
# generate data from causal model
data <- CausalQueries::make_data(model,1000)
# update model utilizing graph splitting
# setting parallel true will run optimal nested parallelism i.e. updating models in parallel and
# running markov chains within models in parallel
model <- update_stitch(model, data, chains = 2, parallel = TRUE)
Complex theories beget complex models; posing substantial computational challenges to the practical application of the CausalQueries framework to theories of complexity often favored by practitioners. Even relatively simple seeming DAGs may begin to test the limits of memory and processing power available on modern hardware once encoded as causal models with standard CausalQueries. The reason is that default models place no restrictions on the types of causal relations that are allowed, conditional on the DAG, and the number of possible causal relations increases very rapidly with the number of causes for a given variable because of the many ways that causes can interact with each other.
Consider, by way of illustration, the model described by the following DAG. It alone has ~4 Billion nodal- and ~8796 Billion “causal-types” (possible responses), as well as a myriad of further parameters attached to both.
A binary node’s set of nodal-types grows as a function of its number of
parents
These numbers grow so rapidly that theories quickly become impossible to
represent and operate on as causal models. Simply storing all
nodal-types on
In response users need either tools to simplify models and to work with large models, or both.
Consider first model simplification. One approach to simplification is
to restrict the nodal type space via monotonicity and no interaction
assumptions. CausalQueriesTools
provides the make_simple_model()
function to this end. This effectively reverses the standard
CausalQueries
workflow of make_model()
restrict_model()
by creating nodal-types consistent with the above
assumptions and attaching them to the model. CausalQueriesTools
further allows adding interactions of arbitrary order on any node back
into monotonistic, non-interacted models via the interact_model()
function. This method exploits the fact that interaction nodal-types can
be generated from monotonistic, non-interacted nodal-types by applying
logical AND and OR operations to their component digits. In a
monotonistic, non-interacted model described by the DAG
Consider now the problem of updating large models. CausalQueriesTools
utilizes graph splitting and nested parallel processing to make updating
large causal modles more efficient. The update_stitch()
functionality
leverages the local markov property to factor DAGs into more efficiently
updatable sub-DAGs. Given the following causally ordered DAG
Instead of updating a full model using standard CausalQueries
update_model()
, update_stitch()
can thus update its much simpler
component sub-models and ‘stitch’ the posteriors back together. Using
this process, model updating run-time no longer grows exponentially with
model size, but rather linearly with respect to the average complexity
of component sub-models. We are able to further optimize run-time by
implementing nested futures evaluation for parallel processing. Nested
parallelism allows update_stitch()
update sub-models in parallel while
also running markov chains within each sub-model in parallel. Given a
sufficient number of cores the optimal run-time of update_stitch()
is
therefore the updating time of the most complex sub-model plus overhead
created by splitting, stitching and parallel process set-up.
A performance comparison updating the above model on data with 1000 observations yields the following results:
method | min | mean | median | max |
---|---|---|---|---|
update_model | 35.83 | 38.69 | 38.88 | 40.33 |
update_stitch | 8.99 | 10.07 | 10.00 | 11.08 |
To show that both methods yield the same result we query the updated
models for an ATE of
method | mean | sd | conf.low | conf.high |
---|---|---|---|---|
update_model | 2.04e-05 | 0.0000924 | -0.0001645 | 0.0002509 |
update_stitch | 2.09e-05 | 0.0001026 | -0.0001702 | 0.0002623 |
To illustrate the above points on the performance gains derived through
model splitting, we benchmark update_model()
and update_stitch()
on
updating models of increasing size with data consisting of 1000
observations. Models are simple causal chains with 2 to 7 nodes. We
update each model 10 times using each function and present average
run-times. We run 4 markov chains in parallel for both update_model()
and update_stitch()
, with update_stitch()
further parallelising
across sub-models.
Space and time complexity efficient querying of causal models with billions of causal-types is currently under development.
Querying a causal model, at its most fundamental involves taking the matrix product between an 1 x m matrix of causal types and a m x n matrix of causal type distributions where ‘m’ is the number of causal types implicated by a given query and ‘n’ is the number of draws from a causal type’s probability distribution. This matrix product is normalized by a vector of column sums of the m x n causal type probability distribution matrix.
The above process requires 5 key data structures:
- the matrix of causal types implicated by a query
- the matrix of causal type distributions
- a matrix mapping parameters to causal types
- a matrix of parameter distributions
- a set of matrices of realised outcomes for each node and causal type
More specifically the construction of the 1 x m matrix of causal types implicated by a query requires a set of matrices of realised outcomes given a set of do operations. The construction of the m x n matrix of causal type distributions requires a mapping of parameters to causal types and a matrix of parameter distributions. All data structures but the parameter distribution grow with the number of causal types in a model. Assuming a model with 1x10^9 causal types, constructing an integer vector of causal types would consume 4 Gigabytes of memory (4 bytes per signed integer x 1x10^9). Constructing a double matrix of causal type distributions with the standard 4000 draws consumes 32000 Gigabytes (64 Terabytes) of memory (8 bytes per double x 4000 x 1x10^9). These estimates represent lower bounds on memory usage ignoring pointers, metadata, attributes and memory consumed by operations on the data structures.
- generating causal models with up to n > 4 parents per child by imposing monotonicity and no-interaction restrictions
- functionality to add interactions back to monotonicity and no-interaction models
- updating large models via graph splitting and stitching (currently only possible without confounding and with complete data)
- space & time complexity efficient querying
- updating via graph splitting and stitching with confounding and missing data
- helpers to generate optimal graph splitting strategies
- methods for validating DAGs using data