Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
-
Updated
Jan 3, 2025 - Python
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Steering vectors for transformer language models in Pytorch / Huggingface
Sparse and discrete interpretability tool for neural networks
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
Multi-Layer Sparse Autoencoders (ICLR 2025)
graphpatch is a library for activation patching on PyTorch neural network models.
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"
A small package implementing some useful wrapping around nnsight
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
MechaMap - Toolkit for Mechanistic Interpretability (MI) Research
Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.
A framework for conducting interpretability research and for developing an LLM from a synthetic dataset.
Starting Kit for the CodaBench competition on Transformer Interpretability
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."