Releases: argilla-io/distilabel
1.4.2
What's Changed
- Fix chat template not applied in
TransformersLLM
by @gabrielmbmb in #1083
Full Changelog: 1.4.1...1.4.2
1.4.1
What's Changed
- Fix not handling list of all primitive types in
SignatureMixin
by @gabrielmbmb in #1037
Full Changelog: 1.4.0...1.4.1
1.4.0
✨ Release highlights
Offline Batch Generation and OpenAI Batch API
We’ve updated the LLM
interface so now LLM
s using an external platform that offers a batch service can be integrated in distilabel
. In addition, OpenAILLM
has been updated so it can use the OpenAI Batch API to get 50% cost reductions.
distilabel-offline-batch-generation.mp4
Improved cache for maximum outputs reusability
We all know that running LLM
is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel
cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset
generated by one that finished its execution and was re-executed.
In this release, we've greatly improved the cache so the outputs of all the Step
s are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:
In addition, we've added a use_cache
attribute in the Step
s that allows toggling the use of the cache at step level.
Steps can generated artifacts
In some cases, Step
produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact
that can be called within the step to store artifacts generated by it. The artifacts generated by the Step
will also get uploaded to the Hugging Face Hub.
from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt
if TYPE_CHECKING:
from distilabel.steps import StepOutput
class CountTextCharacters(GlobalStep):
@property
def inputs(self) -> List[str]:
return ["text"]
@property
def outputs(self) -> List[str]:
return ["text_character_count"]
def process(self, inputs: StepInput) -> "StepOutput": # type: ignore
character_counts = []
for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)
# Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")
# Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)
plt.close()
yield inputs
New Tasks
: CLAIR
, APIGEN
and many more!
- New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A
preferred
A’ is much more contrastive and precise. - New tasks to replicate APIGen framework:
APIGenGenerator
,APIGenSemanticChecker
,APIGenExecutionChecker
. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets - New URIAL task that allows using non-instruct models to generate a response for an instruction.
- New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
- TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
- Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.
New Steps to sample data in your pipelines and remove duplicates
- New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
- New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
- New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
- New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
- New CombineOutputs to combine the outputs of two or more steps into a single output.
Generate text embeddings using vLLM
- Now you can generate embeddings using vLLMEmbeddings!
Extra things
- Easily visualize the tasks’ prompts using Task.print method.
- New use_default_structured_outputs flag in tasks to automatically use structured generation in some tasks that can benefit from it.
What's Changed
- Make
ClientvLLM.model_name
acached_property
by @gabrielmbmb in #862 - Pass dataset to dry_run method by @plaguss in #863
- Add default structured output for
GenerateSentencePair
task by @plaguss in #868 - Complexity scorer default structured output by @plaguss in #870
- Quality scorer default structured output by @plaguss in #873
- Ultrafeedback default structured output by @plaguss in #876
- Remove use of
default_chat_template
by @gabrielmbmb in #888 - Temporary fix for installing
llama-cpp-python
by @gabrielmbmb in #886 - Fix unit tests after release of
transformers==4.44.0
by @gabrielmbmb in #891 - Fix default structured output by @plaguss in #892
- Send as many batches as possible to input queues by @gabrielmbmb in #895
- Exclude
repo_id
fromLoadDataFromFileSystem
by @plaguss in #898 - Fix loader to read from a glob pattern by @plaguss in #877
- Add
save_artifact
method to_Step
by @gabrielmbmb in #871 - Add new
add_raw_input
argument to_Task
so we can automatically include the formatted input by @plaguss in #903 - New
TruncateTextColumn
to truncate the length of texts using the number of tokens or characters by @plaguss in #902 - Update
inputs
andoutputs
interface to allow returning dict indicating optionality by @gabrielmbmb in #883 - Update mistrallm by @plaguss in #904
- Deepseek prover by @plaguss in #907
- Update
RewardModelScore.inputs
property by @gabrielmbmb in #908 - Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in #893
- Fix load data from disk by @plaguss in #910
- docs: minor fixes by @davidberenstein1957 in #913
- Add
URIAL
task by @gabrielmbmb in #921 - Add
vLLMEmbeddings
by @plaguss in #920 - docs: add tutorials preference and clean by @sdiazlor in #917
- Fix
StructuredGeneration
examples and internal check by @plaguss in #912 - Generate deterministic pipeline name when it's not given by @plaguss in #878
- Add custom errors by @plaguss in #911
- Docs/tutorials fix by @sdiazlor in #922
- Add
revision
runtime parameter toLoadDataFromHub
by @gabrielmbmb in #928 - Add plausible as replacement for GA by @davidberenstein1957 in #929
- Add minhash related steps to deduplicate texts by @plaguss in #931
- docs: API reference review by @sdiazlor in #932
- Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in #937
- Update
make_generator_step
to set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in https://g...
1.3.2
What's Changed
- Deepseek prover task by @plaguss in #733
- Do not cancel in progress docs workflows by @gabrielmbmb in #919
- Fix creating Ray placement groups for vLLM by @gabrielmbmb in #918
- Fix passing
base_url
inmodel_id
inInferenceEndpointsLLM
by @gabrielmbmb in #924
Full Changelog: 1.3.1...1.3.2
1.3.1
What's Changed
- Create new
distilabel.constants
module to store constants and avoid circular imports by @plaguss in #861 - Add OpenAI request timeout by @ashim-mahara in #858
New Contributors
- @ashim-mahara made their first contribution in #858
Full Changelog: 1.3.0...1.3.1
1.3.0
What's Changed
- Add new step
CombineKeys
by @plaguss in #747 - Refactor naming columns steps combinecolumns combinekeys expandcolumns by @davidberenstein1957 in #758
- Drop remove deprecated
LoadHubDataset
by @davidberenstein1957 in #759 - Add
requirements
list forPipeline
by @plaguss in #720 - Add
StepResources
and step replicas inPipeline
by @gabrielmbmb in #750 - Add load stages by @gabrielmbmb in #760
- Update min required version to
python==3.9
by @gabrielmbmb in #770 - Optionally include the pipeline script in the hub when pushing your distiset by @plaguss in #762
- Add
docs-pr.yml
anddocs-pr-close.yml
workflows by @gabrielmbmb in #774 - Add
RayPipeline
class by @gabrielmbmb in #769 - Fixed closed PR workflow by @gabrielmbmb in #776
- Add
Magpie
andMagpieGenerator
tasks by @gabrielmbmb in #778 - Fix some issues related to
Magpie
task by @gabrielmbmb in #783 - Add
end_with_user
andinclude_system_prompt
flags toMagpie
tasks and handleNone
s. by @gabrielmbmb in #784 - Add workflow concurrency group for publishing docs by @gabrielmbmb in #796
- Add
_desired_num_gpus
attribute toCudaDevicePlacementMixin
by @gabrielmbmb in #795 - Compatibility with
vLLM
withtensor_parallel_size
argument by @gabrielmbmb in #805 - Update default names in
GroupColumns
by @plaguss in #808 - Request batches to
GeneratorStep
if only step in pipeline by @gabrielmbmb in #828 - Add default name for a pipeline by @plaguss in #809
- Update distilabel phrasing based on PR hugging face hub by @davidberenstein1957 in #821
- Some more
Magpie
improvements by @gabrielmbmb in #833 - Add
Embeddings
base class,SentenceTransformerEmbeddings
class,EmbeddingGeneration
andFaissNearestNeighbour
steps by @gabrielmbmb in #830 - Create file per hostname in
CudaDevicePlacementMixin
by @gabrielmbmb in #814 - Create a
GeneratorStep
from a dataset using a helper function by @plaguss in #812 - Do not take into account
disable_cuda_device_placement
for pipeline signature by @gabrielmbmb in #838 - Add
RewardModelScore
step by @gabrielmbmb in #840 - Fix
LoadDataFromHub
attribute_dataset
hadellipsis
by default instead ofNone
by @gabrielmbmb in #841 - Create
PlacementGroup
for steps usingvLLM
by @gabrielmbmb in #842 - Update
argilla
integration to useargilla_sdk
v2 by @alvarobartt in #705 - Make
overall-rating
the default aspect forUltraFeedback
task by @gabrielmbmb in #843 - fix typo index.md by @franperic in #844
- Use
CudaDevicePlacementMixin
inRewardModelScore
step by @gabrielmbmb in #845 - Gather GPUs per Ray node to create placement groups by @gabrielmbmb in #848
- Fix typo in docs by @plaguss in #850
- Add
xfail
routing batch function tests by @gabrielmbmb in #852 - Fix creating placement group when
pipeline_parallel_size>1
by @gabrielmbmb in #851 - docs: 846 docs include google analytics by @davidberenstein1957 in #847
- Add
ClientvLLM
class by @gabrielmbmb in #854 - Add hard-negative flag to include similar challenging negatives on triplets by @plaguss in #856
- Add bibtex references in the docstrings to be shown in the README by @plaguss in #855
- distilabel
1.3.0
by @gabrielmbmb in #857
New Contributors
- @franperic made their first contribution in #844
Full Changelog: 1.2.4...1.3.0
1.2.4
What's Changed
- Update
InferenceEndpointsLLM
to usechat_completion
method by @gabrielmbmb in #815
Full Changelog: 1.2.3...1.2.4
1.2.3
What's Changed
- Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue #785) by @Hassaan-Qaisar in #786
- Correct variable name in dataset push example (in ultrafeedback.md file) (Issue #787) by @Hassaan-Qaisar in #791
- docs: update script for issue dashboard by @sdiazlor in #775
- Fix 404 model not found for private Serverless IE by @dvsrepo in #806
New Contributors
- @Hassaan-Qaisar made their first contribution in #786
Full Changelog: 1.2.2...1.2.3
1.2.2
What's Changed
- Fix passing
input
toformat_output
function by @gabrielmbmb in #781
Full Changelog: 1.2.1...1.2.2
1.2.1
What's Changed
- Fix docs for distiset.save_to_disk kwargs by @fpreiss in #745
- docs: change references by @sdiazlor in #754
- Fix
response_format
forTogetherLLM
andAnyScaleLLM
by @gabrielmbmb in #764
New Contributors
Full Changelog: 1.2.0...1.2.1