Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix LM Eval #313

Open
wants to merge 3 commits into
base: bseval_harness
Choose a base branch
from
Open

Prefix LM Eval #313

wants to merge 3 commits into from

Conversation

Muennighoff
Copy link
Collaborator

@Muennighoff Muennighoff commented Jul 16, 2022

This PR adapts evaluation to work with Prefix LMs, such as used for T0 finetuning experiments.

Using the normal eval harness I get the following results:

Using CHECKPOINT_PATH=$six_ALL_CCFRSCRATCH/checkpoints/tr11f-6B3-ml/checkpoints/main/global_step163750 (CKPT prior to MTF):
copa "acc": 0.58

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step2000:
copa "acc": 0.7

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100:
copa "acc": 0.67

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100 without --prefix:
copa "acc": 0.73

@Muennighoff
Copy link
Collaborator Author

cc @lintangsutawika @haileyschoelkopf - can't add you as reviewers somehow, but would be great if you could take a look.
I'm not 100% sure about the results I got 🧐

@lintangsutawika
Copy link
Contributor

Will take a closer look.

@lintangsutawika
Copy link
Contributor

@Muennighoff so the intended results is suppose to be that with Prefix-LM the performance should be higher, right? However, based on the scores you shared, this does not seem to be the case.

@Muennighoff
Copy link
Collaborator Author

Yeah so according to the current results evaluating the model as a causallm is better than a prefixlm after it was fine-tuned as a prefixlm. Also note:

  • In both cases it is better than prior to fine-tuning.
  • There is no strong performance difference for the CD + CD & CD + ND models in the What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? paper. I.e. for
CD:FLM (219B) + CD:MTF (13B)
CD:FLM (219B) + ND:MTF (13B

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants