This guide may be of special interest to users who are using the library outside of the repository, via installing the library via pypi and calling lm_eval.evaluator.evaluate()
to evaluate an existing model.
In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the lm_eval.api.model.LM
class, that defines how the Evaluation Harness should interface with your model. This guide walks through how to write this LM
subclass via adding it to the library!
To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your model, and install the project requirements in your environment:
# After forking...
git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout -b <model-type>
pip install -e ".[dev]"
Now, we'll create a new file where we'll be adding our model:
touch lm_eval/models/<my_model_filename>.py
Tip: this filename should not shadow package names! For example, naming your file anthropic.py
is disallowed since the API's name on pypi is anthropic
, but naming it anthropic_llms.py
works with no problems.
All models must subclass the lm_eval.api.model.LM
class.
The LM class enforces a common interface via which we can extract responses from a model:
class MyCustomLM(LM):
#...
def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
#...
def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]:
#...
def generate_until(self, requests: list[Instance]) -> list[str]:
#...
#...
Where Instance
is a dataclass defined in lm_eval.api.instance
with property args
of request-dependent type signature described below.
We support three types of requests, consisting of different interactions / measurements with an autoregressive LM.
All three request types take as input requests
of type list[Instance]
that have a matching Instance.request_type
to the method name.
-
generate_until
- Each request contains
Instance.args : Tuple[str, dict]
containing 1. an input string to the LM and 2. a dictionary of keyword arguments used to control generation parameters. - Using this input and these generation parameters, text will be sampled from the language model (typically until a maximum output length or specific stopping string sequences--for example,
{"until": ["\n\n", "."], "max_gen_toks": 128}
). - The generated input+output text from the model will then be returned.
- Each request contains
-
loglikelihood
- Each request contains
Instance.args : Tuple[str, str]
containing 1. an input string to the LM and 2. a target string on which the loglikelihood of the LM producing this target, conditioned on the input, will be returned. - Each request will have, as result,
(ll, is_greedy): Tuple[float, int]
returned, wherell
is a floating point number representing the log probability of generating the target string conditioned on the input, andis_greedy
being either the value0
or1
, with it being1
if and only if the target string would be generated by greedy sampling from the LM (that is, if the target string is the most likely N-token string to be output by the LM given the input. )
- Each request contains
-
loglikelihood_rolling
- Each request contains
Instance.args : Tuple[str]
, which is an input string to the model whose entire loglikelihood, conditioned on purely the EOT token, will be calculated. - This is used to evaluate perplexity on a data distribution.
- It should return
(ll,) : Tuple[float]
, a.k.a. solely the loglikelihood of producing each piece of text given no starting input.
- Each request contains
To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that loglikelihood_rolling
is a special case of loglikelihood
). For a reference implementation, check out lm_eval/models/huggingface.py
! Additionally, check out lm_eval.api.model.TemplateLM
for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the lm_eval.models.huggingface.HFLM
class and overriding just the initialization or a couple methods!
Tip: be careful of indexing in loglikelihood!
LMs take in tokens in position [0 1 2 ... N]
and output a probability distribution for token position N+1
. We provide a simplified graphic here, excerpted from huggingface.py
:
# how this all works (illustrated on a causal decoder-only setup):
# CTX CONT
# inp 0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1]
# model \ \
# logits 1 2 3|4 5 6 7 8 9 <- the ctx half gets tossed out by the
# cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice
The final token of the target is not passed into the LM, because we want the LM's predictions up to but not past that final target token. For more information, check out #942 .
Congrats on implementing your model! Now it's time to test it out.
To make your model usable via the command line interface to lm-eval
using python -m lm_eval
, you'll need to tell lm-eval
what your model's name is.
This is done via a decorator, lm_eval.api.registry.register_model
. Using register_model()
, one can both tell the package what the model's name(s) to be used are when invoking it with python -m lm_eval --model <name>
and alert lm-eval
to the model's existence.
from lm_eval.api.registry import register_model
@register_model("<name1>", "<name2>")
class MyCustomLM(LM):
Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at lm_eval.api.registry.MODEL_REGISTRY
. See lm_eval.api.registry
for more detail on what sorts of registries and decorators exist in the library!
Tip: be sure to import your model in lm_eval/models/__init__.py!
We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py .
Many models are fine-tuned with a Chat Template in order to enable back-and-forth interaction between a "User"'s queries and the model (often called "Assistant")'s responses. It can be desirable to evaluate fine-tuned models on evaluation tasks while wrapped in the conversational format they expect.
In order to make your model optionally compatible with a chat format, three additional methods must be implemented:
class MyCustomLM(LM):
#...
@property
def tokenizer_name(self) -> str:
"""
Return the name of the model's tokenizer and/or the accompanying chat template.
The returned string is used to cache requests.
Returns:
str: The name of the model's tokenizer and/or chat template.
"""
def chat_template(self, chat_template: Union[bool, str] = False) -> str:
"""
Get the appropriate chat template for the model based on the `chat_template` argument.
This method returns the chat template string to build the prompt from a chat history.
The chat template is saved in the evaluation results for reproducibility.
Boolean arguments should be used with models that have only one chat template,
while string arguments are used with models that have multiple chat templates.
For the reference implementation, see HFLM class in `lm_eval.models.huggingface`.
Args:
chat_template (Union[bool, str]): Specifies whether to apply a chat template:
- If False: Do not apply any chat template.
- If True: Apply the default chat template.
- If str: Apply the specified chat template by name.
Returns:
str: The selected chat template in Jinja format.
"""
def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
"""
Process a chat history to create a string that can be tokenized and input into the model.
Args:
chat_history (List[Dict[str, str]]): A list of dictionaries representing the chat history,
where each dictionary has "role" and "content" keys.
Returns:
str: A string representing the chat history that can be tokenized and fed into the model.
"""
apply_chat_template
- This method performs the bulk of the work required for chat-formatting.
- As input, a
chat_history: List[Dict[str, str]]
is passed in. This is a transcript of a conversation of a form similar towhich can then be converted into a string input.[ {"system": <user-provided system message such as "You are a helpful math-focused chatbot">}, {"user": <task example - a few-shot example 'input'>} {"assistant": <correct response to the above example>}, # ... more few-shot examples, potentially {"user": <test set query--response on which we will evaluate>}, ]
- The output is a string representing this conversation that can be fed into the model.
- For example, this consists of simply calling
tokenizer.apply_chat_template
for HFLM--see the implementation there for reference.
tokenizer_name
- LM Eval Harness supports caching requests that are sent to a model, for faster setup when repeating an already-performed evaluation.
- However, we don't want to use the cache of chat transcripts rendered using one chat template or system prompt to send to a model with a different template! So, we use this
lm.tokenizer_name
string to distinguish caches for a given model (and chat template) from one another.
chat_template
- Chat templates are typically provided as a Jinja template string or a string formatted with str.format to include user and assistant messages in a single prompt. This template string is saved in the evaluation results to ensure reproducibility.
If not implemented for a given model type, the flags --apply_chat_template
, --fewshot_as_multiturn
, and --system_instruction
cannot be used.
Pro tip: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in descending order by total input length via lm_eval.utils.Reorderer
. Take a look at lm_eval.models.hf_causal.HFLM
to see how this is done, and see if you can implement it in your own model!
After reading this guide, you should be able to add new model APIs or implementations to the Eval Harness library!