Skip to content

Latest commit

 

History

History
20 lines (14 loc) · 2.21 KB

microsoft-ml-tokenizers-migration-guide.md

File metadata and controls

20 lines (14 loc) · 2.21 KB

Porting to Microsoft.ML.Tokenizers

This guide provides general guidance on how to migrate from various tokenizer libraries to Microsoft.ML.Tokenizers for Tiktoken.

Microsoft.DeepDev.TokenizerLib

API Guidance

Microsoft.DeepDev.TokenizerLib Microsoft.ML.Tokenizers
TikTokenizer Tokenizer
ITokenizer Tokenizer
TokenizerBuilder TiktokenTokenizer.CreateForModel
TiktokenTokenizer.CreateForModel(Async/Stream) user provided file stream

General Guidance

  • To avoid embedding the tokenizer's vocabulary files in the code assembly or downloading them at runtime when using one of the standard Tiktoken vocabulary files, utilize the TiktokenTokenizer.CreateForModel function. The table lists the mapping of model names to the corresponding vocabulary files used with each model. This table offers clarity regarding the vocabulary file linked with each model, alleviating users from the concern of carrying or downloading such vocabulary files if they utilize one of the models listed.
  • Avoid hard-coding tiktoken regexes and special tokens. Instead use the appropriate Tiktoken.TiktokenTokenizer.CreateForModel/Async method to create the tokenizer using the model name, or a provided stream.
  • Avoid doing encoding if you need the token count or encoded Ids. Instead use TiktokenTokenizer.CountTokens for getting the token count and TiktokenTokenizer.EncodeToIds for getting the encode ids.
  • Avoid doing encoding if all you need is to truncate to a token budget. Instead use TiktokenTokenizer.GetIndexByTokenCount or GetIndexByTokenCountFromEnd to find the index to truncate from the start or end of a string, respectively.