Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
-
Updated
Jan 3, 2025 - Python
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
This repository collects all relevant resources about interpretability in LLMs
Decomposing and Editing Predictions by Modeling Model Computation
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Steering vectors for transformer language models in Pytorch / Huggingface
Interpreting how transformers simulate agents performing RL tasks
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
🧠 Starter templates for doing interpretability research
Sparse and discrete interpretability tool for neural networks
Sparse probing paper full code.
Generating and validating natural-language explanations.
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Universal Neurons in GPT2 Language Models
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages." Paper accepted at Findings of EMNLP 2024
Multi-Layer Sparse Autoencoders (ICLR 2025)
CoSy: Evaluating Textual Explanations
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."