Skip to content

Commit

Permalink
Improve argilla integration (#119)
Browse files Browse the repository at this point in the history
* Set `0.1.0` version

* Delete unused `examples/label-dataset-using-judgelm.py`

* Upgrade `argilla` to 1.18.0

Cannot be lower as the `metadata_properties` were included in the v1.18.0

* Improve `argilla` integration

Now the methods to implemnt on each `Task` are `to_argilla_dataset` and `to_argilla_record`, so that it's easier and straight forward for the users willing to integrate Argilla within their tasks

* Update `README.md`

* Update `README.md`

* Fixed some typos thanks to @codespell-project

* Rename `responses_column` to `generations_column`

* Clean `_merge_rationales` to re-use `generations_column` too

* Update `README.md`

* Update `README.md`

* Update `README.md`

* Rename `responses_values->ratings_values`

* Apply suggestions from code review

Co-authored-by: Gabriel Martin <gabriel@argilla.io>

---------

Co-authored-by: Gabriel Martin <gabriel@argilla.io>
  • Loading branch information
alvarobartt and gabrielmbmb authored Nov 29, 2023
1 parent 9cbd29e commit 7154396
Show file tree
Hide file tree
Showing 11 changed files with 281 additions and 415 deletions.
186 changes: 61 additions & 125 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,54 @@
<div align="center">
<h1>⚗️ distilabel</h1>
<p>
<em>AI Feedback framework for building datasets and labelers with LLMs</em>
</p>
</div>
<div align="center">
<h1>⚗️ distilabel</h1>
<p><em>AI Feedback (AIF) framework for building datasets and labellers with LLMs</em></p>
</div>

## What's distilabel
distilabel is a framework for AI engineers to align LLM using RLHF-related methods (e.g., reward models, DPO).
![overview](https://github.com/argilla-io/distilabel/assets/36760800/360110da-809d-4e24-a29b-1a1a8bc4f9b7)

> [!TIP]
> To discuss, get support, or give feedback [join Argilla's Slack Community](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g) and you will be able to engage with our amazing community and also with the core developers of `argilla` and `distilabel`.
## What's `distilabel`?

`distilabel` is a framework for AI engineers to align LLMs using RLHF-related methods (e.g. reward models, DPO).

The initial focus is LLM fine-tuning and adaptation but we'll be extending it for predictive NLP use cases soon.

Main use cases are:

1. As an AI engineer I want to **build domain-specific instruction datasets** to fine-tune OSS LLMs with increased accuracy.
2. As an AI engineer I want to **build domain-specific and diverse preference datasets** to use RLHF-related methods and align LLMs (e.g, increase the ability to follow instructions or give thruthful responses).
2. As an AI engineer I want to **build domain-specific and diverse preference datasets** to use RLHF-related methods and align LLMs (e.g, increase the ability to follow instructions or give truthful responses).

This readme might be outdated the best place to get started is the [documentation](http://distilabel.argilla.io/).
> [!WARNING]
> `distilabel` is currently under active development and we're iterating quickly, so take into account that we may introduce breaking changes in the releases during the upcoming weeks, and also the `README` might be outdated the best place to get started is the [documentation](http://distilabel.argilla.io/).
> [!TIP]
> To discuss, get support, give feedback [join Argilla's Slack Community](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g)
## Motivation

> [!TIP]
> To contribute check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).
🔥 Recent projects like [Zephyr](https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-6538c6d6d5ddd1cbb1744a66) and [Tulu](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101) have shown it's possible to **build powerful open-source models with DPO and AI Feedback** (AIF) datasets.

👩‍🔬 There's a lot of exciting research in the AIF space, such as [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) (the dataset leveraged by Zephyr and Tulu), [JudgeLM](https://github.com/baaivision/JudgeLM), or [Prometheus](https://huggingface.co/kaist-ai/prometheus-13b-v1.0).

🚀 However, going beyond research efforts and applying AIF at scale it's different. For enterprise and production use, we need framework that implements **key AIF methods on a robust, efficient and scalable way**. This framework should enable AI engineers to build custom datasets at scale for their own use cases.

👩‍🎓 This, combined with humans-in-the-loop for improving dataset quality is the next big leap for OSS LLM models.

⚗️ `distilabel` aims to bridge this gap.

## Key features

* 🤖 **Leverage OSS models and APIs**: 🤗 transformers, OpenAI, 🤗 Inference Endpoints, vLLM, llama.cpp, and more to come.

* 💻 **Scalable and extensible**: Scalable implementations of existing methods (e.g. UltraFeedback). Easily extensible to build and configure your own labellers.

* 🧑‍🦱 **Human-in-the-loop**: One line of code integration with Argilla to improve and correct datasets.

## Quickstart

### Installation

Install with `pip` (requires Python 3.8+):
```sh

```bash
pip install distilabel[openai,argilla]
```

Expand All @@ -41,153 +61,69 @@ After installing, you can immediately start experimenting with `distilabel`:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rO1-OlLFPBC0KPuXQOeMpZOeajiwNoMy?usp=sharing)

### Example: build a preference dataset for DPO/RLHF
### Example: Build a preference dataset for DPO/RLHF

```python
from datasets import load_dataset
from distilabel.llm import OpenAILLM
from distilabel.pipeline import pipeline
from distilabel.tasks import TextGenerationTask

# dataset with instructions
# Load a dataset with instructions from the Hub
dataset = (
load_dataset("HuggingFaceH4/instruction-dataset", split="test[:5]")
.remove_columns(["completion", "meta"])
.rename_column("prompt", "input")
)

# use gpt3.5 turbo for generating responses
task = TextGenerationTask()

# Use `OpenAILLM` (running `gpt-3.5-turbo`) to generate responses for given inputs
generator = OpenAILLM(
task=task,
max_new_tokens=512
#openai_api_key="sk-.."
task=TextGenerationTask(),
max_new_tokens=512,
# openai_api_key="sk-...",
)

# build preference dataset comparing two responses
# focusing on the instruction-following skill
pipe = pipeline("preference", "instruction-following", generator=generator)
pipeline = pipeline("preference", "instruction-following", generator=generator)

dataset = pipe.generate(dataset, num_generations=2)
# Build a preference dataset comparing two responses focused on the instruction-following skill of the LLM
dataset = pipeline.generate(dataset)
```

The resulting dataset can already be used for preference tuning (a larger version of it). But beware these AIF dataset are imperfect. To get the most out of AIF feedback, push to Argilla for human feedback:
The resulting dataset can already be used for preference tuning (a larger version of it). But beware these AIF dataset are imperfect. To get the most out of AIF, push to Argilla for human feedback:

```python
import argilla as rg

rg.init(
api_key="<YOUR_API_KEY>",
api_key="<YOUR_ARGILLA_API_KEY>",
api_url="<YOUR_ARGILLA_API_URL>"
)

rg_dataset = dataset.to_argilla()
rg_dataset.push_to_argilla(name="preference-dataset", workspace="admin")
```



https://github.com/argilla-io/distilabel/assets/1107111/be34c95c-8be4-46ef-9437-cbd2a7687e30

### More examples


## Motivation
🔥 Recent projects like [Zephyr](https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-6538c6d6d5ddd1cbb1744a66) and [Tulu](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101) have shown it's possible to **build powerful open-source models with DPO and AI Feedback** (AIF) datasets.

👩‍🔬 There's a lot of exciting research in the AIF space, such as [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) (the dataset leveraged by Zephyr and Tulu), [JudgeLM](https://github.com/baaivision/JudgeLM), or [Prometheus](https://huggingface.co/kaist-ai/prometheus-13b-v1.0).

🚀 However, going beyond research efforts and applying AIF at scale it's different. For enterprise and production use, we need framework that implements **key AIF methods on a robust, efficient and scalable way**. This framework should enable AI engineers to build custom datasets at scale for their own use cases.

👩‍🎓 This, combined with humans-in-the-loop for improving dataset quality is the next big leap for OSS LLM models.

⚗️ `distilabel` aims to bridge this gap.

## Key features

* 🤖 **Leverage OSS models and APIs**: HF Transformers, OpenAI, HF Inference Endpoints, vLLM, LlamaCPP, and more to come.

* 💻 **Scalable and extensible**: Scalable implementations of existing methods (e.g., UltraFeedback). Easily extensible to build and configure your own labelers.

* 🧑‍🦱 **Human-in-the-loop**: One line of code integration with Argilla to improve and correct datasets.

## Overview
![distilabel_overview](https://github.com/argilla-io/distilabel/assets/1107111/182c871c-108f-441e-bb3e-f01b080f8631)

Find more examples of different use cases of `distilabel` under [`examples/`](./examples/).

## Roadmap

- Add Critique Models and support for Prometheus OSS
- Add a generator with multiple models
- Train OSS labelers to replace OpenAI labelers
- Add labelers to evolve instructions generated with self-instruct
- Add labelers for predictive NLP tasks: text classification, information extraction
- Open an issue to suggest a feature!
- [ ] Add Critique Models and support for Prometheus OSS
- [ ] Add a generator with multiple models
- [ ] Train OSS labellers to replace OpenAI labellers
- [ ] Add labellers to evolve instructions generated with self-instruct
- [ ] Add labellers for predictive NLP tasks: text classification, information extraction, etc.
- [ ] Open an issue to suggest a feature!

## How to generate instructions
If you don't have an instruction or prompts dataset you can generate one with our `self-instruct` inspired generator:
## Contribute

```python
import os
from distilabel.tasks import SelfInstructTask
from distilabel.pipeline import Pipeline
from distilabel.llm import OpenAILLM
from datasets import Dataset

math_topics = [
"Algebraic Expressions",
"Linear Equations",
"Quadratic Equations",
"Polynomial Functions",
"Rational Expressions",
"Exponential Functions",
"Logarithmic Functions",
"Sequences and Series",
"Matrices",
"Determinants",
#...
]

dataset = Dataset.from_dict({
"input": math_topics
})

# it will steer the generator
# to generate instructions for this specific app
instruction_task = SelfInstructTask(
application_description= """
An AI assistant adept at answering a wide array of math, logic, and reasoning puzzles, trivia, and general questions.
""",
num_instructions=10 # 10 instructions per input
)

# default model is: gpt3.5-turbo
# you can choose gpt-4 too
instruction_generator = OpenAILLM(
task=instruction_task,
openai_api_key=os.getenv("OPENAI_API_KEY"),
num_threads=8,
max_new_tokens=1024
)
To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).

pipeline = Pipeline(
generator=instruction_generator
)
## References

# will generate
distiset = pipeline.generate(
dataset=dataset,
# 10 instruction * 10 generations * 10 inputs = 1000 instructions
num_generations=10,
batch_size=4
)
# Output:
# Number of generated instructions: 2044
# 1. Provide an explanation for solving a quadratic equation step by step.
# 2. What is the process for simplifying an algebraic expression with exponents?
# 3. Detail how to factorize a polynomial equation.
# ...
# 10. How can one determine if a given graph represents a linear or quadratic equation?
# 1. How can I simplify the algebraic expression (x^2 + 3x + 2)(2x - 1)?
# 2. Provide step-by-step instructions on how to solve the equation 4(x + 2) - 3 = 7(2x - 1).
# ...
```
* [UltraFeedback: Boosting Language Models with High-quality Feedback](https://arxiv.org/abs/2310.01377)
* [JudgeLM: Fine-tuned Large Language Models are Scalable Judges](https://arxiv.org/abs/2310.17631)
* [Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560)
50 changes: 0 additions & 50 deletions examples/label-dataset-using-judgelm.py

This file was deleted.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ hf-inference-endpoints = ["huggingface_hub >= 1.19.0"]
llama-cpp = ["llama-cpp >= 0.2.0"]
openai = ["openai >= 1.0.0"]
vllm = ["vllm >= 0.2.1"]
argilla = ["argilla >= 1.16.0"]
argilla = ["argilla >= 1.18.0"]
tests = ["pytest >= 7.4.0"]
docs = [
"mkdocs-material >= 9.4.10",
Expand Down
2 changes: 1 addition & 1 deletion src/distilabel/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.

__version__ = "0.1.0rc2"
__version__ = "0.1.0"
17 changes: 7 additions & 10 deletions src/distilabel/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,6 @@

from distilabel.utils.imports import _ARGILLA_AVAILABLE

if _ARGILLA_AVAILABLE:
import argilla as rg

if TYPE_CHECKING:
from argilla import FeedbackDataset

Expand Down Expand Up @@ -57,13 +54,13 @@ def to_argilla(self) -> "FeedbackDataset":
"The task is not set. Please set it with `dataset.task = <task>`."
)

rg_dataset = rg.FeedbackDataset(
fields=self.task.to_argilla_fields(dataset_row=self[0]),
questions=self.task.to_argilla_questions(dataset_row=self[0]),
metadata_properties=self.task.to_argilla_metadata_properties(
dataset_row=self[0]
),
)
try:
rg_dataset = self.task.to_argilla_dataset(dataset_row=self[0]) # type: ignore
except Exception as e:
raise ValueError(
f"Error while converting the dataset to an Argilla `FeedbackDataset` instance: {e}"
) from e

for dataset_row in self:
if any(
dataset_row[input_arg_name] is None # type: ignore
Expand Down
4 changes: 2 additions & 2 deletions src/distilabel/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -392,7 +392,7 @@ def _build_dataset( # noqa: C901
processed_labels.extend(future.result())
except Exception as e:
logger.error(
f"An error ocurred when getting the result from the labeller: {e}"
f"An error occurred when getting the result from the labeller: {e}"
)
processed_labels.append(
[
Expand Down Expand Up @@ -498,7 +498,7 @@ def generate( # noqa: C901
warnings.warn(
f"Provided `num_generations={num_generations}` which implies that the "
"`generator` LLM will just run once, while the `labelling` LLM expects "
"to recieve a list of N inputs to label, where N is > 1. If this is not "
"to receive a list of N inputs to label, where N is > 1. If this is not "
"intended, make sure to set `num_generations` to a value higher or "
"equal to 2.",
UserWarning,
Expand Down
Loading

0 comments on commit 7154396

Please sign in to comment.