mechanistic-interpretability

Star

Here are 17 public repositories matching this topic...

stanfordnlp / pyvene

Star

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

intervention interpretability mechanistic-interpretability activation-intervention activation-patching

Updated Jan 3, 2025
Python

OpenMOSS / Language-Model-SAEs

Star

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.

sparse-autoencoders interpretability sparse-dictionary mechanistic-interpretability

Updated Jan 23, 2025
Python

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms

program-synthesis knowledge-distillation inductive-logic-programming domain-adaptation explainable-ai interpretable distilling neurosymbolic model-distillation out-of-distribution-generalization mechanistic-interpretability

Updated Feb 20, 2024
Python

steering-vectors / steering-vectors

Star

Steering vectors for transformer language models in Pytorch / Huggingface

nlp ai pytorch gpt huggingface mechanistic-interpretability representation-engineering

Updated Nov 21, 2024
Python

taufeeque9 / codebook-features

Star

Sparse and discrete interpretability tool for neural networks

transformers features language-model interpretability codebook mechanistic-interpretability

Updated Feb 12, 2024
Python

aryamanarora / causalgym

Star

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

benchmark causality interpretability mechanistic-interpretability syntaxgym

Updated Nov 30, 2024
Python

koayon / atp_star

Star

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

machine-learning large-language-models mechanistic-interpretability

Updated Jan 19, 2025
Python

tim-lawson / mlsae

Star

Multi-Layer Sparse Autoencoders (ICLR 2025)

transformer sae sparse-autoencoder mechanistic-interpretability

Updated Jan 23, 2025
Python

evan-lloyd / graphpatch

Star

graphpatch is a library for activation patching on PyTorch neural network models.

pytorch interpretability large-language-models mechanistic-interpretability

Updated Oct 23, 2024
Python

francescortu / comp-mech

Star

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

interpretability llm mechanistic-interpretability

Updated May 24, 2024
Python

Zhaoyi-Li21 / creme

Star

[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"

multi-hop-reasoning large-language-models mechanistic-interpretability compositional-reasoning factual-reasoning

Updated Aug 28, 2024
Python

Butanium / nnterp

Star

A small package implementing some useful wrapping around nnsight

mechanistic-interpretability nnsight patchscopes

Updated Dec 12, 2024
Python

Nix07 / binding-circuit-discovery

Star

This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".

mechanistic-interpretability science-of-deep-learning

Updated Mar 12, 2024
Python

tegridydev / mechamap

Star

MechaMap - Toolkit for Mechanistic Interpretability (MI) Research

python open-source transformers research-tool explanability mechanistic-interpretability llm-research

Updated Dec 27, 2024
Python

cx0 / mech-interpretability

Sponsor

Star

Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.

ioi mechanistic-interpretability indirect-object-identification

Updated Jan 5, 2024
Python

DanielJamesDavies / Turing-LLM-1.0-254M

Star

A framework for conducting interpretability research and for developing an LLM from a synthetic dataset.

python sparse-autoencoders interpretability mechanistic-interpretability large-language-model

Updated Sep 10, 2024
Python

AlejoAcelas / Mech-Interp-Challenges

Star

Starting Kit for the CodaBench competition on Transformer Interpretability

competitive-programming transformer mechanistic-interpretability

Updated Sep 8, 2023
Python

Improve this page

Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mechanistic-interpretability

Here are 17 public repositories matching this topic...

stanfordnlp / pyvene

OpenMOSS / Language-Model-SAEs

pauljblazek / deepdistilling

steering-vectors / steering-vectors

taufeeque9 / codebook-features

aryamanarora / causalgym

koayon / atp_star

tim-lawson / mlsae

evan-lloyd / graphpatch

francescortu / comp-mech

Zhaoyi-Li21 / creme

Butanium / nnterp

Nix07 / binding-circuit-discovery

tegridydev / mechamap

cx0 / mech-interpretability

DanielJamesDavies / Turing-LLM-1.0-254M

AlejoAcelas / Mech-Interp-Challenges

Improve this page

Add this topic to your repo