-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evalution of Llama on Hellaswag #539
Comments
You should be able to use HF's implementation of Llama Also note there is an issue with tokenization which is currently a work-in-progress. |
I evaluated it recently like so: python main.py --model hf-causal-experimental --model_args pretrained=huggyllama/llama-7b,use_accelerate=True --tasks hellaswag --batch_size auto
For the implementation, you compute summed log probabilities ( |
As en example import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('huggyllama/llama-7b')
model = AutoModelForCausalLM.from_pretrained('huggyllama/llama-7b', device_map='auto', load_in_8bit=True)
test = [
('Paris is the capital of', ' England.'),
('Paris is the capital of', ' Germany.'),
('Paris is the capital of', ' France.'),
('Paris is the capital of', ' Japan.'),
]
# encode full sentences, questions, and answers, no padding for simplicity
batched_sentences = tokenizer.batch_encode_plus([q + a for q, a in test], add_special_tokens=False, return_tensors='pt')['input_ids']
batched_questions = tokenizer.batch_encode_plus([q for q, _ in test], add_special_tokens=False, return_tensors='pt')['input_ids']
# run the model on full sentences and get the log probabilities
batched_logprobs = F.log_softmax(model(batched_sentences.cuda())['logits'], dim=-1).cpu()
# take log probabilities corresponding to possible answer tokens
batched_logprobs = batched_logprobs[:, len(batched_questions[0]) - 1 : -1, :]
# get the scores by summing log probabilities corresponding to correct answer tokens, unvectorized
scores = []
for sentence, question, logprobs in zip(batched_sentences, batched_questions, batched_logprobs):
answer = sentence[len(question):]
guess = logprobs.argmax(dim=-1)
print(tokenizer.decode(guess), bool((guess == answer).all()))
scores.append(float(torch.gather(logprobs, 1, answer.unsqueeze(-1)).sum()))
# predict the answer
test[torch.tensor(scores).argmax()] Can answer correctly even if the |
Hello, I was wondering how could we alter this code for few-shot prompting? Thank you so much. |
Hi, there, how do you evaluate Llama on Hellaswag? Llama does not have an API that can pass in an argument called echo and returns the log probs. To the best of my knowledge, if we are using Llama from Huggingface, we can only get logits as model output. How to make it work on Hellaswag?
The text was updated successfully, but these errors were encountered: