Revised documentation

- Slightly update and revise existing GCM notebooks - Moving mediation analysis, direct arrow strength and ICC to their own "Quantify Causal Influence" section - Adding brief overview to describe differences between the quantification methods - Change navigation image to reflect newest changes - Adding related notebooks links to some of the causal task entries - Adding a direct arrow strength example to the ICC notebook - Adding a brief overview of the available root cause analysis and explanation methods - Smaller revision of other GCM entries, such as the basic example - Smaller typos and missing refernce fixes Signed-off-by: Patrick Bloebaum <bloebp@amazon.com>
py-why · Dec 1, 2023 · 9ebf9e3 · 9ebf9e3
1 parent f455c56
commit 9ebf9e3
Show file tree

Hide file tree

Showing 39 changed files with 494 additions and 261 deletions.
diff --git a/docs/source/example_notebooks/gcm_401k_analysis.ipynb b/docs/source/example_notebooks/gcm_401k_analysis.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Impact of 401(k) eligibility on net financial assets\n",
     "\n",
-    "In this case study, we will use real-world data from 401(k) analysis to explain how Causality library can be used to estimate average treatment effect (ATE) and conditional ATE (CATE)."
+    "In this case study, we will use real-world data from 401(k) analysis to explain how graphical causal models can be used to estimate average treatment effect (ATE) and conditional ATE (CATE)."
    ]
   },
   {
@@ -106,6 +106,9 @@
     "edges.extend([(covariate, treatment_var) for covariate in covariates])\n",
     "edges.extend([(covariate, outcome_var) for covariate in covariates])\n",
     "\n",
+    "# To ensure that the treatment is considered as a categorical variable, we convert the values explicitly to strings.\n",
+    "df = df.astype({treatment_var: str})\n",
+    "\n",
     "causal_graph = nx.DiGraph(edges)"
    ]
   },
@@ -153,9 +156,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We observe that real-valued variables do not follow well-known parameteric distributions like Gaussian. Therefore, we fit empirical distributions whenever those variables do not have parents, which is also suitable for categorical variables. \n",
+    "We observe that real-valued variables do not follow well-known parametric distributions like Gaussian. Therefore, we fit empirical distributions whenever those variables do not have parents, which is also suitable for categorical variables. \n",
     "\n",
-    "Let's assign causal models to variables. For the treatment variable, we assign a classifier functional causal model (FCM) with a random forest classifier. For the outcome variable, we assign an additive noise model with a random forest regression as a function and empirical distribution for the noise. We assign empirical distributions to other variables as they do not have parents in the causal graph."
+    "Let’s assign the causal mechanisms to each node explicitly. For the treatment variable, we assign a classifier functional causal model (FCM) with a random forest classifier. For the outcome variable, we assign an additive noise model with random forest regression as a function and an empirical distribution for the noise. We assign empirical distributions to other variables as they do not have parents in the causal graph."
    ]
   },
   {
@@ -175,27 +178,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To fit a classifier FCM, we cast the treatment column to string type."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "df = df.astype({treatment_var: str})"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<div class=\"alert alert-block alert-info\">\n",
-    "Instead of assigning the models manually, we can also automate this **if** we don't have prior knowledge or are not familiar with the statistical implications:\n",
-    "    \n",
-    "> gcm.auto.assign_causal_mechanisms(causal_model, df)\n",
-    "</div>"
+    "> **If** we don't have prior knowledge or are not familiar with the statistical implications, we can also assign causal mechanisms automatically using a heuristic:\n",
+    "```\n",
+    "gcm.auto.assign_causal_mechanisms(causal_model, df)\n",
+    "```"
    ]
   },
   {
@@ -281,7 +267,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The average treatment effect of 401(k) eligibility on net financial assets is positive as indicated by the confidence interval $[4902.24, 8486.89]$. Now, let's plot CATEs of various income groups to get a clear picture."
+    "The average treatment effect of 401(k) eligibility on net financial assets is positive as indicated by the corresponding confidence interval. Now, let's plot CATEs of various income groups to get a clear picture."
    ]
   },
   {
@@ -327,7 +313,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.9.16"
   }
  },
  "nbformat": 4,

diff --git a/docs/source/example_notebooks/gcm_basic_example.ipynb b/docs/source/example_notebooks/gcm_basic_example.ipynb
@@ -252,9 +252,7 @@
    "id": "67cfaf23-1b90-4124-84cb-39ea67401a22",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-block alert-info\">\n",
-    "Note, this evaluation take some significant time depending on the model complexities, graph size and amount of data. For a speed-up, consider changing the evaluation parameters.\n",
-    "</div>"
+    "> Note, this evaluation can take some significant time depending on the model complexities, graph size and amount of data. For a speed-up, consider changing the evaluation parameters."
    ]
   },
   {
@@ -264,9 +262,7 @@
    "source": [
     "## Step 3: Answering a causal query based on the SCM\n",
     "\n",
-    "The last step, answering a causal question, is our actual goal. E.g. we could ask the question:\n",
-    "\n",
-    "> What will happen to the variable Z if I intervene on Y?\n",
+    "The last step, answering a causal question, is our actual goal. E.g. we could ask the question \"What will happen to the variable Z if I intervene on Y?\".\n",
     "\n",
     "This can be done via the `interventional_samples` function. Here’s how:"
    ]
@@ -305,9 +301,7 @@
    "id": "d65e3319-58ea-4284-8810-c88b10d08448",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-block alert-info\">\n",
-    "DoWhy offers a wide range of causal questions that can be answered with GCMs. See the user guide or other notebooks for more examples.\n",
-    "</div>"
+    "> DoWhy offers a wide range of causal questions that can be answered with GCMs. See the user guide or other notebooks for more examples."
    ]
   }
  ],

diff --git a/docs/source/example_notebooks/gcm_draw_samples.ipynb b/docs/source/example_notebooks/gcm_draw_samples.ipynb
@@ -138,9 +138,7 @@
    "id": "f430827d-fc75-4c23-8d2d-3d98605dd927",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-block alert-info\">\n",
-    "While the evaluation provides us insights toward the causal graph structure as well, we cannot confirm the graph structure, only reject it if we find inconsistencies between the dependencies of the observed structure and what the graph represents. In our case, we do not reject the DAG, but there are other equivalent DAGs that would not be rejected as well. To see this, consider the example above - X→Y→Z and X←Y←Z would generate the same observational distribution (since they encode the same conditionals), but only X→Y→Z would generate the correct interventional distribution (e.g., when intervening on Y).\n",
-    "</div>"
+    "> While the evaluation provides us insights toward the causal graph structure as well, we cannot confirm the graph structure, only reject it if we find inconsistencies between the dependencies of the observed structure and what the graph represents. In our case, we do not reject the DAG, but there are other equivalent DAGs that would not be rejected as well. To see this, consider the example above - X→Y→Z and X←Y←Z would generate the same observational distribution (since they encode the same conditionals), but only X→Y→Z would generate the correct interventional distribution (e.g., when intervening on Y)."
    ]
   }
  ],

diff --git a/docs/source/example_notebooks/gcm_icc.ipynb b/docs/source/example_notebooks/gcm_icc.ipynb
@@ -37,7 +37,7 @@
    "id": "098c7ca0-9391-4033-b056-d09ac5365abf",
    "metadata": {},
    "source": [
-    "In the first example, we use the famous [MPG data set](https://archive.ics.uci.edu/dataset/9/auto+mpg), which contains different features that are used for the prediction of miles per gallon (mpg) of a car engine. The relationship between these features can be modeled as a graphical causal model. For this, we follow the causal graph defined in the [work by Wang et al.](https://ieeexplore.ieee.org/document/8585647) and remove all nodes that have no influence on MPG. This leaves us with the following graph:"
+    "In the first example, we use the famous [MPG data set](https://archive.ics.uci.edu/dataset/9/auto+mpg), which contains different features that are used for the prediction of miles per gallon (mpg) of a car engine. Let's say our task is to improve the design process where we need a good understanding of the influences of our variables on the mpg consumption. The relationship between these features can be modeled as a graphical causal model. For this, we follow the causal graph defined in the [work by Wang et al.](https://ieeexplore.ieee.org/document/8585647) and remove all nodes that have no influence on MPG. This leaves us with the following graph:"
    ]
   },
   {
@@ -120,12 +120,39 @@
     "print(gcm.evaluate_causal_model(scm_mpg, auto_mpg_data, evaluate_invertibility_assumptions=False, evaluate_causal_structure=False))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8b95b48a-a6ac-4d7c-84a6-718a73841aeb",
+   "metadata": {},
+   "source": [
+    "After defining our structural causal model, we can now obtain more insights into what factors influence fuel consumption. As a first insight, we can estimate the direct arrow strength of the connections weight -> mpg and horsepower -> mpg. Note that by default, the arrow strength method measures the influence with respect to the variance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6257a9e2-a1f4-4c16-b629-b5a13bbf32dc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "arrow_strengths_mpg = gcm.arrow_strength(scm_mpg, target_node='mpg')\n",
+    "gcm.util.plot(scm_mpg.graph, causal_strengths=arrow_strengths_mpg)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42e62ed2-7e29-49ac-99cd-bca20d7565ca",
+   "metadata": {},
+   "source": [
+    "As we see here, the weight has a much higher influence on the variance in mpg than horsepower does."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "5258a976-dce8-4268-9fb9-2a03c03325a5",
    "metadata": {},
    "source": [
-    "After defining our structural causal model, we can now apply the ICC method to obtain more insights into what factors influence fuel consumption. This could help us improve the design process. Note that by default, we attribute the variance of the target node to the upstream nodes."
+    "While knowing how much the direct parents influence our node of interest provides some valuable insights, the weight and horsepower might only just inherit information from their common parent. To distinguish between the information inherited from the parent and their own contribution, we apply the ICC method:"
    ]
   },
   {
@@ -173,7 +200,7 @@
    "id": "1606ebc4-7223-48c0-b92d-5fb13761d81b",
    "metadata": {},
    "source": [
-    "It turns out that the number of cylinders already explains a large fraction of the fuel consumption and the intermediate nodes like displacement, horsepower, and weight mostly inherit uncertainty from their parents. This is because, although weight and horsepower are the more direct predictors of mpg, they are mostly determined by displacement and cylinders. As we also see with the contribution of mpg itself, roughly 1/4 of the variance of mpg remains unexplained by all of the above factors, which may be partially due to model inaccuracies.\n",
+    "It turns out that the number of cylinders already explains a large fraction of the fuel consumption and the intermediate nodes like displacement, horsepower, and weight mostly inherit uncertainty from their parents. This is because, although weight and horsepower are the more direct predictors of mpg, they are mostly determined by displacement and cylinders. This gives some useful insights for potential optimizations. As we also see with the contribution of mpg itself, roughly 1/4 of the variance of mpg remains unexplained by all of the above factors, which may be partially due to model inaccuracies.\n",
     "\n",
     "While the model evaluation showed that there are some inaccuracies with respect to the KL divergence between the generated and observed distributions, we see that ICC still provides non-trivial results in the sense that the contributions differ significantly across nodes and that not everything is simply attributed to the target node itself."
    ]
@@ -183,9 +210,7 @@
    "id": "043ab28d-6732-4fa8-8c59-8a592f2058be",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-block alert-info\">\n",
-    "Note that estimating the contribution to the variance of the target in ICC can be seen as a nonlinear version of ANOVA that incorporates the causal structure.\n",
-    "</div>"
+    "> Note that estimating the contribution to the variance of the target in ICC can be seen as a nonlinear version of ANOVA that incorporates the causal structure."
    ]
   },
   {
@@ -201,18 +226,15 @@
    "id": "5fa50502-152a-4d77-985c-b4ae51426e96",
    "metadata": {},
    "source": [
-    "In the next example, we look at different recordings taken of the river flows ($m^3/s$) at a 15 minute frequency across 5 different measuring stations in England at Henthorn, New Jumbles Rock, Hodder Place, Whalley Weir and Samlesbury. The data is taken from the [UK Department for Environment Food & Rural Affairs website](https://environment.data.gov.uk/hydrology/explore). Here is a map of the rivers:"
+    "In the next example, we look at different recordings taken of the river flows ($m^3/s$) at a 15 minute frequency across 5 different measuring stations in England at Henthorn, New Jumbles Rock, Hodder Place, Whalley Weir and Samlesbury. Here, obtaining a better understanding of how the river flows behave can help to plan potential mitigation steps to avoid overflows. The data is taken from the [UK Department for Environment Food & Rural Affairs website](https://environment.data.gov.uk/hydrology/explore). Here is a map of the rivers:"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c773d712-0806-4fc9-9739-62f2749406e2",
+   "cell_type": "markdown",
+   "id": "68bfc4ad-94e9-43ff-84bb-5fa1c82a06c4",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "from IPython.display import Image\n",
-    "Image('river-map.jpg') "
+    "<img src=\"./river-map.jpg\" width=\"600\"/>"
    ]
   },
   {
@@ -243,25 +265,23 @@
    "id": "38ea910f-b7f3-4923-9fc9-b40bb3a7f2b6",
    "metadata": {},
    "source": [
-    "Here, we are interested in the causal influence of the upstream rivers on the Samlesbury river. For instance, to obtain a better understanding of how the river flows behave and to potentially plan mitigation steps to avoid overflows. Similar to the example before, we would expect these nodes to be heavily confounded by, e.g., the weather. That is, the true graph is more likely to be along the lines of:"
+    "In this setting, we are interested in the causal influence of the upstream rivers on the Samlesbury river. Similar to the example before, we would expect these nodes to be heavily confounded by, e.g., the weather. That is, the true graph is more likely to be along the lines of:"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "88cd1c6e-d8c7-48e6-bc37-7b5280f39d57",
+   "cell_type": "markdown",
+   "id": "c0a0441a-929f-40f7-a4ed-5601846959da",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "Image('river-confounded.png') "
+    "<img src=\"./river-confounded.png\" width=\"400\"/>"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "b94dea14-9058-406d-8466-eb066815c095",
    "metadata": {},
    "source": [
-    "Nevertheless, we still expect the ICC algorithm to provide some insights into the contribution to the river flow of Samlesbury, even with the hidden confounder in place."
+    "Nevertheless, we still expect the ICC algorithm to provide some insights into the contribution to the river flow of Samlesbury, even with the hidden confounder in place:"
    ]
   },
   {

diff --git a/docs/source/example_notebooks/gcm_online_shop.ipynb b/docs/source/example_notebooks/gcm_online_shop.ipynb
@@ -5,7 +5,7 @@
    "id": "1c74ae7a-e77a-4b38-be41-7fb82e6930a5",
    "metadata": {},
    "source": [
-    "# Causal attribution and root-cause analysis of an online shop"
+    "# Causal Attributions and Root-Cause Analysis in an Online Shop"
    ]
   },
   {
@@ -274,25 +274,23 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(gcm.evaluate_causal_model(scm, data_2021, compare_mechanism_baselines=True, evaluate_invertibility_assumptions=False, evaluate_causal_structure=False))"
+    "print(gcm.evaluate_causal_model(scm, data_2021, compare_mechanism_baselines=True, evaluate_invertibility_assumptions=False))"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "b65c73b6-28af-4b57-b05f-c6c6245a714a",
    "metadata": {},
    "source": [
-    "The fitted causal mechanisms are fairly good representations of the data generation process, with some minor inaccuracies. However, this is to be expected given the small sample size and relatively small signal-to-noise ratio for many nodes. Most importantly, all the baseline mechanisms did not perform better, which is a good indicator that our model selection is appropriate. Note that even if the models have some inaccuracies, most of the GCM-based algorithms are generally robust against this."
+    "The fitted causal mechanisms are fairly good representations of the data generation process, with some minor inaccuracies. However, this is to be expected given the small sample size and relatively small signal-to-noise ratio for many nodes. Most importantly, all the baseline mechanisms did not perform better, which is a good indicator that our model selection is appropriate. Based on the evaluation, we also do not reject the given causal graph."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "a2e5acb0-f40e-4478-bee8-db3ff58bea21",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-block alert-info\">\n",
-    "The selection of baseline models change be configured as well. For more details, take a look at the corresponding evaluate_causal_model documentation.\n",
-    "</div>"
+    "> The selection of baseline models or the p-value for graph falsification can be configured as well. For more details, take a look at the corresponding evaluate_causal_model documentation."
    ]
   },
   {
@@ -301,6 +299,34 @@
    "metadata": {},
    "source": [
     "## Step 3: Answer causal questions\n",
+    "### Generate new samples\n",
+    "\n",
+    "Since we learned about the data generation process, we can also generate new samples:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "74ed17eb-45ec-444d-9364-2aa850911a05",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gcm.draw_samples(scm, num_samples=10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8dfc67f4-f89c-480d-8f6a-b0a74651ef3f",
+   "metadata": {},
+   "source": [
+    "We have drawn 10 samples from the joint distribution following the learned causal relationships."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a10e5ff-c5aa-4386-b462-bd696a1a54ed",
+   "metadata": {},
+   "source": [
     "### What are the key factors influencing the variance in profit?"
    ]
   },
@@ -553,9 +579,7 @@
    "id": "c2c82783-9a69-43c3-820c-048a6f5860bf",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-block alert-info\">\n",
-    "Note that the results differ depending on the selected data, since they are sample specific. On other days, other factors could be relevant. Furthermore, note that the analysis (including the confidence intervals) always relies on the modeling assumptions made. In other words, if the models change or have a poor fit, one would also expect different results.\n",
-    "</div>"
+    "> Note that the results differ depending on the selected data, since they are sample specific. On other days, other factors could be relevant. Furthermore, note that the analysis (including the confidence intervals) always relies on the modeling assumptions made. In other words, if the models change or have a poor fit, one would also expect different results."
    ]
   },
   {