Skip to content

Open source project for data preparation of LLM application builders

License

Notifications You must be signed in to change notification settings

depenglee1707/data-prep-kit

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Prep Kit


Data Prep Kit is a community project to democratize and accelerate unstructured data preparation for LLM app developers. With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case-specific unstructured data to fine-tune or instruct-tune the LLMs. As the variety of use cases grows, so does the need to support:

  • New modalities of data (code, language, speech, visual)
  • New ways of transforming the data to optimize the performance of the resulting LLMs for each specific use case.
  • A large variety in the scale of data to be processed, from laptop-scale to datacenter-scale

Data Prep Kit offers implementations of commonly needed data preparation steps, called modules or transforms, for both Code and Language modalities. The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks.

📝 Table of Contents

📖 About

Data Prep Kit is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning or instruction-tuning. Data Prep Kit contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case. These modules have been tested while producing pre-training datasets for the Granite open models, here and here.

The modules are built on common frameworks (for Spark and Ray), called the data processing library that allows the developers to build new custom modules that readily scale across a variety of runtimes. Eventually, Data Prep Kit will offer consistent APIs and configurations across the following underlying runtimes.

  1. Python runtime
  2. Ray runtime (local and distributed)
  3. Spark runtime (local and distributed)
  4. Kubeflow Pipelines (local and distributed, wrapping Ray)

Features of the toolkit:

Data modalities supported:

  • Code - support for code datasets as downloaded .zip files of GitHub repositories converted to parquet files.
  • Language - supports for natural language datasets, and like the code transformations, will operate on parquet files.
  • Universal - supports code and natrual langauge datasets, and can operate on with parquet files, zip archives, or individual HTML files.

Support for additional data modalities is expected in the future and additional data formats are welcome!

Data Preparation Modules

Matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed here and can be combined to form data processing pipelines, as shown in the examples folder. The modules are under three major categories: 1) Universal (apply to both code and language) 2) Language-only and 3) Code-only. We start with a set of modules for ingestion of various data formats.

Modules Python-only Ray Spark KFP on Ray
Data Ingestion
Code (from zip) to Parquet
PDF to Parquet
HTML to Parquet
Universal (Code & Language)
Exact dedup filter
Fuzzy dedup filter
Unique ID annotation
Filter on annotations
Profiler
Resize
Tokenizer
No-op / template
Language-only
Language identification
Document quality
Document chunking for RAG
Text encoder
Code-only
Programming language annnotation
Code quality annotation
Malware annotation
Header cleanser
Semantic file ordering

Contributors are welcome to add new modules as well as add runtime support for existing modules!

Data Processing Framework

At the core of the framework, is a data processing library, that provides a systematic way to implement the data processing modules. The library is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. We use the popular parquet format to store the data (code or language). Every parquet file follows a set schema. A user can use one or more transforms (or modules) as discussed above to process their data.

Transform Design

A transform can follow one of the two patterns: annotator or filter.

  • Annotator An annotator transform adds information during the processing by adding one more columns to the parquet files. The annotator design also allows a user to verify the results of the processing before the actual filtering of the data.

  • Filter A filter transform processes the data and outputs the transformed data, e.g., exact deduplication. A general purpose SQL-based filter transform enables a powerful mechanism for identifying columns and rows of interest for downstream processing. For a new module to be added, a user can pick the right design based on the processing to be applied. More details here.

Scaling of Transforms

To enable processing of large data volumes leveraging multi-mode clusters, Ray or Spark wrappers are provided, to readily scale out the Python implementations. A generalized workflow is shown here.

Bring Your Own Transform

One can add new transforms by bringing in Python-based processing logic and using the Data Processing Library to build and contribute transforms. We have provided an example transform that can serve as a template to add new simple transforms.

More details on the data processing library are here.

Automation

The toolkit also supports transform execution automation based on Kubeflow pipelines (KFP), tested on a locally deployed Kind cluster and external OpenShift clusters. There is an automation to create a Kind cluster and deploy all required components on it. The KFP implementation is based on the KubeRay Operator for creating and managing the Ray cluster and KubeRay API server to interact with the KubeRay operator. An additional framework along with several kfp components is used to simplify the pipeline implementation.

A simple transform pipeline tutorial explains the pipeline creation and execution. In addition, if you want to combine several transformers in a single pipeline, you can look at multi-steps pipeline

When you finish working with the cluster, and want to clean up or destroy it. See the clean up the cluster

Acknowledgements

Thanks to the BigCode Project, which served as the source for borrowing few code quality metrics.

About

Open source project for data preparation of LLM application builders

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 73.4%
  • HTML 16.2%
  • Makefile 6.5%
  • Dockerfile 2.8%
  • Shell 1.1%