Skip to content

Releases: Gavin-Development/GavinBackendDatasetUtils

WordPiece

17 Jan 18:25
Compare
Choose a tag to compare

New Features

  • Completely re worked Tokenizer now based on WordPiece
  • More pythonic interface for Tokenizer

Important Info

  • This build is considered stable enough for a minor release but is the first implimentation of the WordPiece algo. It wont be fast and probably not efficient but it should get the job done with little issue.
  • GPU acceleration has been temporarily removed from the Tokenizer in this release due to the ground up rework.
  • This build is Windows only and requires Intel python 3.9.15 or newer.

Latest Release

07 Jan 14:09
Compare
Choose a tag to compare

Quick Release with latest binaries x64 only.
Windows: .pyd File
Linux: .so File
Ensure these libraries are in your PYTHONPATH to allow them to be loaded by python.

A second Release it seems...

28 Jul 14:58
Compare
Choose a tag to compare
Pre-release

Gavin Backend Dataset Utils Release 28/07/2022 Intel LLVM compiler SYCL / CUDA support.

New Features

  • BIN file class with methods for creatin, modifying & reading BIN files.
  • Tokenizer class to create, build, save, load and use GPU accelerated BPE tokenization algo.
  • Some performance tweaks?

Important Info

This build has been built with the Intel LLVM compiler allowing us to build the module with support for Intel GPUs & Nvidia GPUs via SYCL for GPU accelerated functions. This will have adverse effects on performance on AMD systems as the Intel compiler deliberately produces sub par code for AMD systems so be warned.

NOTE If you choose to use the new CUDA version, you will require the DLLs included in the zip file, if you choose to stick with the SYCL version, you will need the OneAPI toolkit installed from Intel (base toolkit) and need to use Intel python.

A release? I guess so...

04 Feb 19:03
Compare
Choose a tag to compare

Gavin Backend Dataset Utils Release 04/02/2022 VS 2022 preview build

Features

  • BIN file format for storing tokenized data
  • Multithreaded loading of BIN
  • Singlethreaded loading of BIN
  • Transcoding of old file format to BIN
  • Data generator to stream data into RAM
  • Old file type legacy support

Important info

This build was built using MSVCC with C++ std17 on VS 2022 preview, this build has passed basic testing and is ready for use. This build also contains uncompleted features which if used could result in instability. Refer to the included README to see what is and is not available in this build: README.md