This repository contains code for our VeriScore factuality metric, which is a pipelined approach with three steps: (1) claim extraction
, (2) evidence retrieval
, and (3) claim verification
. This package can be run with either closed-source LLMs (requires OpenAI/Anthropic API key), or with fine-tuned models for claim extraction and verification that you can download and use. The evidence retrieval step is performed with Google Search via the Serper API, so you will need to get a Serper API key here to use VeriScore.
Please see our Colab notebook for a demo!
You can choose between prompting OpenAI/Anthropic models or using our fine-tuned models via the model_name
option. If you specify the path to the checkpoint of a fine-tuned model, it will automatically access the local model for the inference. If the model name is not in this format, it will use an API call instead.
- Make a new Python 3.9+ environment using
virtualenv
orconda
. - Install
veriscore
pacakge usingpip
- Download
en_core_web_sm
usingspacy
library - Our code supports inference using fine-tuned models based on the Unsloth library. To use this feature, you need to install the Unsloth library. If you use
conda
, make sure to follow the conda-specific instructions from Unsloth.
pip install --upgrade veriscore
python -m spacy download en_core_web_sm
- Download
prompt
folder that contains txt files of the prompt templates (see theprompt
folder in this repository) - Add an OpenAI or Claude API key to an environment variable in your
bash
for the prompting approach
export OPENAI_API_KEY_PERSONAL={your_openai_api_key}
export CLAUDE_API_KEY={your_claude_api_key}
- Set SERPER API key to an environment variable in your
bash
for evidence retrieval
export SERPER_KEY_PRIVATE={your_serper_api_key}
- For the prompt-based approach, you need to set
data_dir/demos/
with few-shot examples.
This is an end-to-end pipeline for running VeriScore.
python3 -m veriscore.veriscore --data_dir {data_dir} --input_file {input_file} --model_name_extraction {model_name_extraction} --model_name_verification {model_name_verification}
data_dir
: Directory containing input data../data
by default.input_file
: Name of the input data file. It should be in thejsonl
format where each line containsquestion
(optional): A query to prompt a language model for an output. Ifquestion
key is provided, the model automatically adapts to QA prompting. If not, it extracts claims using non-QA prompting.response
: An output generated by the language model given thequestion
model
: Name of the model that generated the responseprompt_source
: Name of the dataset from where thequestion
is from (e.g., FreshQA)
model_name_extraction
: Name of the model used for claim extraction;gpt-4-0125-preview
by default.model_name_verification
: Name of the model used for claim verification;gpt-4o
by default.use_external_extraction_model
: If specified, it uses your custom model instead of the one from the API call. We use Unsloth for the fine-tuned model. False by default.use_external_verification_model
: If specified, it uses your custom model instead of the one from the API call. We use Unsloth for the fine-tuned model. False by default.
Other optional flags:
output_dir
: Directory for saving ouptut data../data
by default.cache_dir
: Directory for saving cache data../data/cache
by default.label_n
: This is the type of label for claim verification. It could be2
(binary) or3
(ternary)2
:supported
andunsupported
3
:supported
,contradicted
, andinconclusive
search_res_num
: A Hyperparameter for number of search result.10
by default.
Saving output:
input_file_name
is file name removed jsonl
from —-input_file
extracted claims
will be saved tooutput_dir/claims_{input_file_name}.jsonl
.searched evidence
will be saved tooutput_dir/evidence_{input_file_name}.jsonl
.verified claims
will be saved tooutput_dir/model_output/verification_{input_file_name}.jsonl
.
Claim extraction
:
python3 -m veriscore.extract_claims --data_dir {data_dir} --input_file {input_file} --model_name {model_name}
input_file
: Name of input data file. It should bejsonl
format where each line containsquestion
(optional): A query to prompt a language model for an output. Ifquestion
key is provided, the model automatically adapts to QA prompting. If not, it extracts claims using non-QA prompting.response
: An output generated by the language model given thequestion
model
: Name of the model that generated the responseprompt_source
: Name of the dataset from where thequestion
is from (e.g., FreshQA)
model_name
: Name of model used for claim extraction.gpt-4-0125-preview
by default.use_external_model
: If specified, it uses your custom model instead of using API calls. We use Unsloth for the fine-tuned models. False by default.
output:
{
"question": question.strip(),
"prompt_source": prompt_source,
"response": response.strip(),
"prompt_tok_cnt": prompt_tok_cnt,
"response_tok_cnt": response_tok_cnt,
"model": model,
"abstained": False,
"claim_list": list of claims for each snippet,
"all_claims": list of all claims
}
Evidence searching
:
python3 -m veriscore.retrieve_evidence --data_dir {data_dir} --input_file {input_file}
input_file
: Name of input data file. It should be in thejsonl
format where each line contains the keys of the output dictionary from theClaim extraction
. output:
{
...
"claim_snippets_dict": dictionary for claim and list of searched evidence. each evidence have dictionary of {"title": title, "snippet": snippet, "link": link}
}
Claim verification
:
python3 -m veriscore.verify_claims --data_dir {data_dir} --input_file {input_file} --model_name {model_name}
input_file
: Name of input data file. It should bejsonl
format where each line contains the keys of the output dictionary from theEvidence searching
.use_external_model
: If specified, it uses your custom model instead model used by API call. We use Unsloth to inference from specified model. False by default.
output:
{
...
"claim_verification_result":[
{
"claim": claim text,
"search_results": concatenated search result
"verification_result": verification label
}
This Google Drive folder contains:
- the long-form generations of the tested models given the prompts in Table 1 in the paper
- human annotation results in Section 3.3