When players submit an explanation on Neuronpedia, we score it to evaluate how good the explanation is, relevant to the activations. The score is produced is from -1 to 1. This is the first step to fully open-sourcing Neuronpedia. 🎉
Goals / Desired Outcomes
- Improve the existing scorer ("see contributing" for ideas) - or fix bugs/issues with it.
- You build a new scorer that is better than one we're using now, and we add it to Neuronpedia and throw a huge party to celebrate.
- Other improvements (enhancements test cases, better docs, etc)
This is a Flask server with only one endpoint that is called by the Neuronpedia server: POST /score
The code is mostly plugging in OpenAI's Automated Interpretability Neuron Explainer demo. The way that works is that for each explanation, it asks GPT to guess ("simulate") what the activations will be for each of the 20 texts. Then it does a correlation between what the actual activations are vs the simulated ones. The correlation is the score.
It's not perfect, but it works well a lot of the time. For details, including the exact prompts used, read their paper or code.
Based on a standard 20 activation texts, each text with 64 tokens, and using the default gpt-4
simulator, it costs a whopping 30 cents (USD) to do one score for an explanation. So, be careful and make sure to set both a hard and soft limit on your OpenAI account. You could also start with sending the scorer fewer activation texts.
If you end up doing serious batches of work, it may be worth it to apply for Research Credits from OpenAI.
Clone the repository:
git clone https://github.com/hijohnnylin/neuronpedia-scorer
In the new directory, create a .env
file that contains:
OPENAI_API_KEY="[insert your OpenAI key here]"
SERVER_KEY="[empty string if you aren't hosting it publicly]"
Then, run the following to install requirements and start the local Flask server at port 5000:
pip install -r requirements.txt
python server.py
> loading
> * Serving Flask app 'server'
> * Debug mode: off
> WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
> * Running on http://127.0.0.1:5000
> Press CTRL+C to quit
Finally, to test that it's working, you can run this curl command, which asks the scorer to score two activation texts: "the quick brown fox" and "spotted leopard sprints at" using the explanation "fast animals".
WARNING: THIS WILL COST A FEW CENTS
curl -XPOST -H "Content-type: application/json" -d '{ "explanation": "fast animals", "secret": "", "activations": [ { "tokens": ["the ", " quick", " brown", " fox"], "values": [0, 2, 0, 2.5] }, { "tokens": ["spotted ", " leopard", " sprints", " at"], "values": [0.5, 2.5, 1.5, 0] } ] }' 'localhost:5000/score'
You should get a response like this, which shows a score of 0.8896 (max is 1), indicating this was a pretty good explanation:
{
"score": 0.8896365998071626,
"simulations": [array of ScoredSequenceSimulation, see below for spec]
}
POST /score
Request
{
explanation: "fast animals",
secret: SERVER_KEY,
activations: [
{
tokens: ["the ", " quick", " brown", " fox"],
values: [0, 2, 0, 2.5]
},
{
tokens: ["spotted ", " leopard", " sprints", " at"],
values: [0.5, 2.5, 1.5, 0]
}
]
}
Response
{
score: the score, a value from from -1 to 1,
simulations: an array of ScoredSequenceSimulation, which is basically just the simulated activation values that GPT returns
}
Here is the format of ScoredSequenceSimulation.
I'd appreciate any contributions you'd like to make, whether that's bug reports, adding new features, etc.
In general, this code is fairly specific in that it's basically a thin wrapper around OpenAI's Neuron Explainer. The most impactful thing is to make the scorer itself better or to come up with a better scorer: for a set of activation texts, a higher score should seem like a better explanation, and a lower score should seem like a worse explanation. Currently, this is not always the case.
Here are a few examples of where the scorer could be improved, mostly around incorporating the context. In these examples, the highest scored explanations are good, but they should probably have a lower score than other explanations that explain the full context around the non-highest-activating tokens.
-
GPT2-SMALL@6:2294 - top score of 33 is
the words 'chapter' and 'on'
. however, another explanationwords and phrases describing the structure of written media, especially chapters and themes
has a lower score yet explains the full context better -
GPT2-SMALL@6:281 - top score is GPT4's
female pronouns and related phrases
, because the token "she" is activated, but it's actually only activated as part of the word "shelled", which has nothing to do with female pronouns - the human explanation by userturnippls
is far better -
GPT2-SMALL@6:2381 - top score is
the word 'time'
but the activations are about spending a portion of time and the other explanation should probably score higher
There are at least a few things that can be tried, tested, and benchmarked to improve scoring accuracy - the following are highly promising but I haven't had time to try.
- Select Different Activation Text Samples - We currently select the top 20 activating texts, each with 64 tokens, to give to the simulator - aka the "top k" approach. We should try variations of this - top 10 plus 10 random ones, or top 20 but prefer activations with different highest activating tokens, etc.
- Modify the GPT Simulation Prompts - The GPT simulation prompts are pretty interesting. But like every piece of software, there can probably be improvements. Maybe you can specifically get the prompt to weigh context as more important than just the top activating token. You can see the prompts used in their paper under 'Step 2: Simulation'. As a bonus, this might result in cost savings if the prompts you come up with are shorter, or if they can be used with GPT-3.5-Turbo (without logprobs), instead of the expensive "text-davinci-003" model used now.
- Add a "Post-Scoring Adjustment" - Maybe after the score/correlation is calculated with the Neuron Explainer scorer, you can then run another prompt or algorithm that adjusts the score based on something you want to prioritize more, like context.
- Create an Entirely New Scorer - Make up your own scorer! We'll plug it into Neuronpedia and have users play with it. You will of course be credited. Some ideas - wanna train your own model? Do a regex scorer? Regex + GPT4 magic sauce? Got some clever dictionary approach?
OpenAI Automated Interpretability
MIT