`flow-judge`

Technical Report | Model Weights | Evaluation Code | Examples

flow-judge is a lightweight library for evaluating LLM applications with Flow-Judge-v0.1.

Model

Flow-Judge-v0.1 is an open, small yet powerful language model evaluator trained on a synthetic dataset containing LLM system evaluation data by Flow AI.

You can learn more about the unique features of our model in the technical report.

Features of the library

Support for multiple model types: Hugging Face Transformers and vLLM
Extensible architecture for custom metrics
Pre-defined evaluation metrics
Ease of custom metric and rubric creation
Batched evaluation for efficient processing
Integrations with most popular frameworks like Llama Index

Installation

Install flow-judge using pip:

pip install -e ".[vllm,hf]"
pip install 'flash_attn>=2.6.3' --no-build-isolation

Extras available:

dev for development dependencies
hf for Hugging Face Transformers support
vllm for vLLM support

Quick Start

Here's a simple example to get you started:

from flow_judge.models.model_factory import ModelFactory
from flow_judge.flow_judge import EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# Create a model using ModelFactory
model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ")

# Initialize the judge
faithfulness_judge = FlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=model
)

# Sample to evaluate
query = """..."""
context = """...""""
response = """..."""

# Create an EvalInput
# We want to evaluate the response to the customer issue based on the context and the user instructions
eval_input = EvalInput(
    inputs=[
        {"query": query},
        {"context": context},
    ],
    output={"response": response},
)

# Run the evaluation
result = faithfulness_judge.evaluate(eval_input, save_results=False)

# Display the result
display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))

Usage

Supported Model Types

Hugging Face Transformers (hf_transformers)
vLLM (vllm)

Evaluation Metrics

Flow-Judge-v0.1 was trained to handle any custom metric that can be expressed as a combination of evaluation criteria and rubric, and required inputs and outputs.

Pre-defined Metrics

For convenience, flow-judge library comes with pre-defined metrics such as RESPONSE_CORRECTNESS or RESPONSE_FAITHFULNESS. You can check the full list by running:

from flow_judge.metrics import list_all_metrics

list_all_metrics()

Batched Evaluations

For efficient processing of multiple inputs, you can use the batch_evaluate method:

# Read the sample data
import json
from flow_judge.models.model_factory import ModelFactory
from flow_judge.flow_judge import EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# Create a model using ModelFactory
model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ")

# Initialize the judge
faithfulness_judge = FlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=model
)

# Load data
with open("sample_data/csr_assistant.json", "r") as f:
    data = json.load(f)

# Create a list of inputs and outputs
inputs_batch = [
    [
        {"query": sample["query"]},
        {"context": sample["context"]},
    ]
    for sample in data
]
outputs_batch = [{"response": sample["response"]} for sample in data]

# Create a list of EvalInput
eval_inputs_batch = [EvalInput(inputs=inputs, output=output) for inputs, output in zip(inputs_batch, outputs_batch)]

# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=False)

# Visualizing the results
for i, result in enumerate(results):
    display(Markdown(f"__Sample {i+1}:__"))
    display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))
    display(Markdown("---"))

Advanced Usage

Model configurations

Warning

There is a reported issue with Phi-3 models that produces gibberish outputs with contexts longer than 4096 tokens, including input and output. This issue has been recently fixed in the transformers library so we recommend using the Flow-Judge-v0.1_HF model configuration for longer contexts at the moment. For more details, refer to: #33129 and #6135

We currently support vLLM engine (recommended) and Hugging Face Transformers.

We are working on adding API-based usage as well as better options for CPU.

Custom Metrics

Create your own evaluation metrics:

from flow_judge.metrics import CustomMetric, RubricItem

custom_metric = CustomMetric(
    name="My Custom Metric",
    criteria="Evaluate based on X, Y, and Z.",
    rubric=[
        RubricItem(score=0, description="Poor performance"),
        RubricItem(score=1, description="Good performance"),
    ],
    required_inputs=["query"],
    required_output="response"
)

judge = FlowJudge(metric=custom_metric, config="Flow-Judge-v0.1-AWQ")

Integrations

We support an integration with Llama Index evaluation module and Haystack:

Note that we are currently working on adding more integrations with other frameworks in the near future.

Development Setup

Clone the repository:

git clone https://github.com/flowaicom/flow-judge.git
cd flow-judge

Create a virtual environment:

virtualenv ./.venv

or

python -m venv ./.venv

Activate the virtual environment:
- On Windows:
```
venv\Scripts\activate
```
- On macOS and Linux:
```
source venv/bin/activate
```
Install the package in editable mode with development dependencies:
```
pip install -e ".[dev]"
```
or
```
pip install -e ".[dev,vllm]"
```
for vLLM support.
Set up pre-commit hooks:
```
pre-commit install
```
Run pre-commit on all files:
```
pre-commit run --all-files
```
You're now ready to start developing! You can run the main script with:
```
python -m flow_judge
```

Remember to always activate your virtual environment when working on the project. To deactivate the virtual environment when you're done, simply run:

deactivate

Running Tests

To run the tests for Flow-Judge, follow these steps:

Navigate to the root directory of the project in your terminal.
Run the tests using pytest:
```
pytest tests/
```
This will discover and run all the tests in the tests/ directory.
If you want to run a specific test file, you can do so by specifying the file path:
```
pytest tests/test_flow_judge.py
```
For more verbose output, you can use the -v flag:
```
pytest -v tests/
```

Contributing

Contributions to flow-judge are welcome! Please follow these steps:

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Please ensure that your code adheres to the project's coding standards and passes all tests.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

Flow-Judge is developed and maintained by the Flow AI team. We appreciate the contributions and feedback from the AI community in making this tool more robust and versatile.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
examples		examples
flow_judge		flow_judge
img		img
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`flow-judge`

Model

Features of the library

Installation

Quick Start

Usage

Supported Model Types

Evaluation Metrics

Pre-defined Metrics

Batched Evaluations

Advanced Usage

Model configurations

Custom Metrics

Integrations

Development Setup

Running Tests

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

License

vaahtio/flow-judge

Folders and files

Latest commit

History

Repository files navigation

flow-judge

Model

Features of the library

Installation

Quick Start

Usage

Supported Model Types

Evaluation Metrics

Pre-defined Metrics

Batched Evaluations

Advanced Usage

Model configurations

Custom Metrics

Integrations

Development Setup

Running Tests

Contributing

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`flow-judge`

Packages