Skip to content

Commit

Permalink
Write some docs
Browse files Browse the repository at this point in the history
  • Loading branch information
TheodoreEhrenborg committed Sep 29, 2024
1 parent c457e7a commit 7a73ee7
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 2 deletions.
24 changes: 22 additions & 2 deletions docs/src/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,30 @@ This is an imitation of Anthropic's
[Scaling monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)
work, except using a far smaller LLM.

but skip to [Steering](./steering.md) to see the final results.

Code [here](https://github.com/TheodoreEhrenborg/tiny_stories_sae)

Terminology:
- activation
- feature---specifically the SAE's

## Demo

I trained a sparse autoencoder with 10000 features, and
ChatGPT says feature 87 relates to

> the concepts of upward motion or reaching a peak state such as 'up', 'fly', 'high', and similar directions or motions

Let's ask the TinyStories model to complete a story starting with
"There once was a cat". I'll nudge each activation after layer 2 in the direction
TODO

And indeed, the cat flies:

> There once was a cat named Mittens. Mittens was very fast and loved to play with his friends. One day, Mittens saw a bird fly in the sky like an orange taxi.
> Mittens said to his friend, "I wish I could fly like a bird." The bird replied, "That's easy, just go slow and never be foolish sometimes."
> Mittens practiced every day, getting better and better. One day, they passed each other and Timmy said, "Wow, you're so good at flying!" Mittens' friends laughed and said, "We are like angels!"
> From that day on, Mittens was never lazy again and always remembered to practice and race with other kids. He felt happy and free while flying high in the sky.
6 changes: 6 additions & 0 deletions docs/src/making_it_sparse.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

Careful when getting the magnitudes for the penalty: If the other dimension isn't summed over first, the tensor ends up being very large

seq_len
768
number of features (e.g. 10000)




Proportion of nonzero features

Expand Down

0 comments on commit 7a73ee7

Please sign in to comment.