readme added

hmd101 · Feb 18, 2025 · cb40738 · cb40738
1 parent e594c1a
commit cb40738
Show file tree

Hide file tree

Showing 5 changed files with 142 additions and 666 deletions.
diff --git a/analysis/causalalign/README.md b/analysis/causalalign/README.md
@@ -1,10 +1,104 @@
-# Code for the Project: Causal Reasoning Capabilities in LLMs
+# CausalAlign -- Evaluating Causal Alignment Between Humans and LLMs
 
+## **Overview**
 
-## Development Installation
-You can install this Python package in an editable fashion via `pip`:
+**CausalAlign** is a Python package designed to evaluate the causal reasoning abilities of **Large Language Models (LLMs)** in comparison to **human inference**. It provides tools to systematically assess how LLMs reason about causal structures—particularly **collider graphs** — and whether their judgments align with normative Bayesian inference or human reasoning biases.
+
+This package is based on the work presented in the paper:
+
+> "Do Large Language Models Reason Causally Like Us? Even Better?"
+> 
+> 
+> *Hanna M. Dettki, Brenden M. Lake, Charley M. Wu, Bob Rehder*
+> 
+> [[arXiv:2502.10215]](https://arxiv.org/pdf/2502.10215)
+> 
+
+## **Key Features**
+
+- Implements the **collider causal inference** framework to compare LLMs' and human reasoning.
+- Provides tools for **(Spearman) correlation analysis** to measure alignment between human and LLM judgments.
+- Fits responses to **Causal Bayesian Networks (CBNs)** to evaluate normative reasoning.
+- Supports API calls for multiple LLMs, including **GPT-4o, GPT-3.5, Claude-3-Opus, and Gemini-Pro**.
+     - for the dataset constructed based on experiment 1 (model-only condition) from Rehder&Waldmann, 2017.
+     - dataset: tbd
+
+
+## **Installation**
+
+To install the package, clone the repository and install dependencies:
 
 ```bash
-cd analysis/causalalign
+git clone <https://github.com/yourusername/causalalign.git>
+cd causalalign
 pip install -e .
+
+```
+
+If you want to install directly from a public GitHub repository:
+
+```bash
+pip install git+https://github.com/yourusername/causalalign.git
+
+```
+
+## **Usage**
+
+```bash
+tbd
+```
+
+
+## **Data and Models**
+
+CausalAlign provides support for evaluating LLMs across different causal inference tasks. It is particularly apt for datasets building on experiments in Rehder & Waldmann (2017).
+
+
+
+It includes built-in support for:
+
+- **Human data from Rehder & Waldmann (2017), experiment 1, model-only condition**
+- **LLM inference responses (GPT-4o, GPT-3.5, Claude, Gemini)**
+
+## **Some Benchmarking Results**
+
+
+### **Paper Summary: "Do Large Language Models Reason Causally Like Us? Even Better?"**
+
+Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but do they **reason causally like humans**? This study investigates whether LLMs exhibit **human-like causal reasoning**, whether they follow **normative Bayesian inference**, or whether they introduce **unique biases**.
+
+The study compares human judgments with four LLMs (**GPT-4o, GPT-3.5, Claude-3-Opus, and Gemini-Pro**) on **causal inference tasks** based on **collider structures** (e.g., `C1 → E ← C2`). The results show:
+
+- **Claude-3-Opus and GPT-4o** were the most **normative**, showing strong **explaining away** behavior.
+- **GPT-3.5 and Gemini-Pro** exhibited more **associative biases**, resembling human errors.
+- LLMs often **overestimated causal strengths** compared to human participants.
+- LLMs relied more on **domain knowledge** than humans, sometimes leading to **non-normative responses**.
+
+This work underscores the need to evaluate LLMs’ reasoning **beyond pattern recognition**, ensuring they make **reliable causal inferences** in **real-world applications**.
+
+For more details, refer to the full paper: [[arXiv:2502.10215]](https://arxiv.org/pdf/2502.10215).
+
+Table: Spearman correlations (**rₛ**) between **human** and **LLM** inferences.
+
+| Model | Economy (rₛ) | Sociology (rₛ) | Weather (rₛ) | Pooled (rₛ) |
+| --- | --- | --- | --- | --- |
+| **Claude** | 0.557 | 0.637 | 0.698 | **0.631** |
+| **GPT-4o** | 0.662 | 0.572 | 0.645 | **0.626** |
+| **GPT-3.5** | 0.419 | 0.450 | 0.518 | 0.462 |
+| **Gemini** | 0.393 | 0.297 | 0.427 | 0.372 |
+
+## **Citation**
+
+Please cite paper as:
+
 ```
+@misc{dettki2025largelanguagemodelsreason,
+  title         = {Do Large Language Models Reason Causally Like Us? Even Better?},
+  author        = {Hanna M. Dettki and Brenden M. Lake and Charley M. Wu and Bob Rehder},
+  year          = {2025},
+  eprint        = {2502.10215},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.AI},
+  url           = {https://arxiv.org/abs/2502.10215}
+}
+```
diff --git a/analysis/causalalign/pyproject.toml b/analysis/causalalign/pyproject.toml
@@ -24,7 +24,7 @@ dynamic = ["version"]
 plotting = ["matplotlib>=3.4.3"]
 
 [project.urls]
-github = "https://github.com/hmd101/llm-causality"
+github = "https://github.com/hmd101/causalalign"
 
 ################################################################################
 # PEP 518 Build System Configuration                                           #

diff --git a/analysis/causalalign/src/causalalign/evaluation/significance_analysis/__init__.py b/analysis/causalalign/src/causalalign/evaluation/significance_analysis/__init__.py
@@ -1,6 +1,3 @@
-# causal_analysis/__init__.py
-
-
 from .significance_analyzer import CausalReasoningAnalysis
 
 __all__ = ["CausalReasoningAnalysis"]