-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.Net: Add BERT ONNX embedding generation service #5518
.Net: Add BERT ONNX embedding generation service #5518
Conversation
dotnet/src/Connectors/Connectors.Onnx.UnitTests/BertOnnxTextEmbeddingGeneration.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Connectors/Connectors.Onnx.UnitTests/BertOnnxTextEmbeddingGeneration.cs
Outdated
Show resolved
Hide resolved
This looks great! I'd approve it but am probably not the relevant person to do so in this repo. |
Nice addition @stephentoub. Looks great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the final version of the connector or are we going to have more PRs. If the later suggest creating a dedicated feature-branch following our current practices for new connectors.
dotnet/src/Connectors/Connectors.Onnx/OnnxKernelBuilderExtensions.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Connectors/Connectors.Onnx/BertOnnxTextEmbeddingGenerationService.cs
Show resolved
Hide resolved
There's nothing further I plan to add right now. When https://github.com/microsoft/onnxruntime-genai/ is further along, we'll want to add support for chat completion based on using ONNX with models like Phi2, Mistral, LLaMa, Gemma, etc. That's part of why I made the connector generally about working with ONNX models, and for now the only service it provides is one for BERT embeddings. Hopefully this can evolve to provide broad support for much more. |
Incurs a 500MB download and doesn't add anything meaningful over the other tests. Also make downloading a bit more reliable, and add a couple of test cases.
dotnet/src/Connectors/Connectors.Onnx/BertOnnxTextEmbeddingGenerationService.cs
Show resolved
Hide resolved
26e6625
to
e9e760e
Compare
Does anyone here happen to know when this will make it to NuGet.org? I'm wondering whether to re-plat SmartComponents.LocalEmbeddings on top of it, and whether I can use SK in a demo of embeddings next week or whether I need to stick with SmartComponents.LocalEmbeddings only for now. |
|
Regarding:
Hey @stephentoub, maintainer of FastBertTokenizer here. I'd be happy to help/contribute in that direction and would also be willing to move FastBertTokenizer in a direction that makes it's use viable here. Is the main concern the lack of I already considered supporting older .NET versions (georg-jung/FastBertTokenizer#23). However, because they lack Any suggestions on how to move this forward? |
Same "happy to help/contribute" of course also applies to what's needed for |
Hi @georg-jung, Thanks for the offer! Currently we're working on consolidating efforts in tokenizers around Microsoft.ML.Tokenizers. The goal is to provide an extensible library that serves as a one-stop-shop for tokenizer needs. So far, we've introduced tokenizers like:
BERT Tokenizers are part of the roadmap. Would you be up for making a contribution to Microsoft.ML.Tokenizers for the BERT tokenizers? |
Sorry for the delay in responding over the holidays. I see. I think I'll need to take a more in depth look into the current API surface and get an impression of the current design to say something more substantial. A one-stop-shop tokenizer library for .NET would of course be great and I'm happy to help with that if it makes sense! Given the broad API surface, my gut feeling is that a BERT implementation in Tokenizers would be essentially a complete rewrite vs. FastBertTokenizers. Some first thoughts from my experience creating FastBertTokenizer:
|
No worries. Thanks for the response @georg-jung. Tagging @tarekgh who can better comment. |
Yah, we can help in that as needed.
We build for netstandard 2.0. If you can tell what functionality you need for that I can try provide similar implementation for downlevel and we can use Rune for .NET targeting.
I am not sure if you are looking at the latest code? we are now have the tokenizer model work with spans like https://github.com/dotnet/machinelearning/blob/0fd58cbfb613113e920977b6891c05fd949486d8/src/Microsoft.ML.Tokenizers/Model/Model.cs#L35. If FastBertTokenizers uses pre-tokenization using regex, this will be difficult to avoid creating strings as regex on down-level always create strings. If you can elaborate more on where you are seeing creating string, I can try to look at that. |
One typical preprecessing step for BERT encoding is removing control chars etc. from the input text. FastBertTokenizer therefor enumerates over the runes and decides based on Rune.GetUnicodeCategory if the thing we look at should be removed. Char.GetUnicodeCategory would work in most cases too. Input text might contain unicode surrogate pairs though and Char.GetUnicodeCategory wouldn't return the correct category for them. The codepoints they represent would thus be removed even though they could be correctly encoded in many cases. Example:
I hope I did. Specificly I'm thinking of
Edit: Oh, sorry, right after posting I noticed that Split won't allocate a string instance if just the
It doesn't use regex. FastBertTokenizer uses a ref struct enumerator over the original input string that enumerates ReadOnlySpan<char>. It's pre-tokenization should be allocation free (if lowercasing isn't required and if the vocabulary doesn't require normalization to FormC or the input is already in FormC; if lowercasing is required it will only lowercase the current token in a small buffer before yielding, which would be almost allocation-free). For corresponding code see https://github.com/georg-jung/FastBertTokenizer/blob/cf29d5adc2b671694fb873335741334045560261/src/FastBertTokenizer/PreTokenizingEnumerator.cs Also note that whitespace isn't "enough" for tokenization. I think e.g. chinese characters tend to be not seperated by whitespace but should still be splitted. I'm not sure if regexes could be used to correctly tokenize them, but e.g. deciding based on the unicode category wouldn't be sufficient here. See https://github.com/georg-jung/FastBertTokenizer/blob/cf29d5adc2b671694fb873335741334045560261/src/FastBertTokenizer/PreTokenizingEnumerator.cs#L169 Most of the string allocations FastBertTokenizer does at all are required for dict lookups and, if required, unicode normalization, because these are scenarios where, I think, the runtime doesn't yet support ReadOnlySpan<char>. If the dict lookups arrive with .NET 9 that would probably have a quite positive impact. dotnet/runtime#27229 dotnet/runtime#87757 |
Thanks for the details @georg-jung.
You may consider using https://learn.microsoft.com/en-us/dotnet/api/system.globalization.charunicodeinfo.getunicodecategory?view=net-8.0#system-globalization-charunicodeinfo-getunicodecategory(system-string-system-int32) which should work with Surrogate too.
In our library , we have a function called Tokenizer.Encode(...). When you use this function, it returns comprehensive encoding data. This data includes the following components:
We allocate memory for normalization due to reasons similar to Unicode normalization. Our interfaces are designed to be generic, accommodating any tokenizer and any scenario. We’ve observed scenarios require text manipulation—for instance, removing specific characters, replacing them, or adding new ones. Additionally, our APIs return offsets or indexes as part of the encoding process, and these are relative to the normalized string.
I’m pretty sure that regular expressions (regex) can address this issue.
In our tokenizers, we’ve introduced a solution that enables us to search the dictionary using spans. You can find the relevant type in the following location: StringSpanOrdinalKey.cs. Additionally, take a look at how this type is utilized in examples like Tiktoken.cs. For dotnet/runtime#87757 even providing that will not be enough for our scenarios as we need to support down-levels netstandard 2.0 and even .NET 8. SentencePiece tokenizers actually include the normalization data inside the tokenizer data which make it self contained. But this will be too much requirements for tokenizer to carry such non-trivial data. |
I think I originally ruled that out as it would require a string allocation for every category check. Thinking about this again, it might actually be a great fallback. On modern .NET we could use non-allocating Rune, on older .NET Char and only if we detect a surrogate use this API to get the category. Thanks!
I think unicode normalization, e.g. FormC and FormD, is a common requirement for BERT. I did a quick check on some random vocabs on HuggingFace and Vocabs like And even for e.g. As a side note, it should be possible to check if a vocabulary has a need for unicode normalization when loading it, so that we could do the right thing without explicit configuration.
But if that arrives it would be possible to multi-target and have the perf wins on modern .NET 9/... and use allocating behaviour on older .NET, no?
Thanks, this looks really interesting! |
FWIW, I did a quick benchmark of the different pretokenization approaches. Note that it doesn't compare exactly the same thing, as the regex methods just create splits at whitespace if I understand correctly, while the ref struct enumerator took more options into account (punctuation + chinese chars). Corpus I tested with is some thousands of articles from simple english wikipedia. It contains unicode surrogate pairs, chinese, .... // * Summary *
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3374/23H2/2023Update/SunValley3)
AMD Ryzen 7 PRO 4750U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.202
[Host] : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
.NET 6.0 : .NET 6.0.28 (6.0.2824.12007), X64 RyuJIT AVX2
.NET 8.0 : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
.NET Framework 4.8.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
| Method | Job | Runtime | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Gen1 | Allocated | Alloc Ratio |
|------------------------ |--------------------- |--------------------- |-----------:|---------:|---------:|------:|--------:|------------:|------------:|-------------:|-------------:|
| RefStructEnumerator | .NET 6.0 | .NET 6.0 | 240.9 ms | 1.35 ms | 1.05 ms | 1.00 | 0.00 | - | - | 4437 B | 1.00 |
| RegexPublicMlNetNuget | .NET 6.0 | .NET 6.0 | 1,647.4 ms | 29.19 ms | 25.88 ms | 6.85 | 0.11 | 478000.0000 | 135000.0000 | 1700021840 B | 383,146.68 |
| RegexCurrentMlNetGithub | .NET 6.0 | .NET 6.0 | 685.6 ms | 11.07 ms | 9.81 ms | 2.84 | 0.04 | 528000.0000 | - | 1104531776 B | 248,936.62 |
| | | | | | | | | | | | |
| RefStructEnumerator | .NET 8.0 | .NET 8.0 | 173.8 ms | 3.47 ms | 5.99 ms | 1.00 | 0.00 | - | - | 256 B | 1.00 |
| RegexPublicMlNetNuget | .NET 8.0 | .NET 8.0 | 1,435.1 ms | 28.11 ms | 27.60 ms | 8.36 | 0.57 | 491000.0000 | 148000.0000 | 1700009576 B | 6,640,662.41 |
| RegexCurrentMlNetGithub | .NET 8.0 | .NET 8.0 | 313.1 ms | 1.52 ms | 1.27 ms | 1.83 | 0.13 | 500.0000 | - | 1320536 B | 5,158.34 |
| | | | | | | | | | | | |
| RefStructEnumerator | .NET Framework 4.8.1 | .NET Framework 4.8.1 | 447.5 ms | 3.46 ms | 3.24 ms | 1.00 | 0.00 | - | - | - | NA |
| RegexPublicMlNetNuget | .NET Framework 4.8.1 | .NET Framework 4.8.1 | 3,232.1 ms | 16.89 ms | 14.97 ms | 7.23 | 0.06 | 471000.0000 | 126000.0000 | 1731811528 B | NA |
| RegexCurrentMlNetGithub | .NET Framework 4.8.1 | .NET Framework 4.8.1 | 2,200.3 ms | 24.80 ms | 23.20 ms | 4.92 | 0.06 | 528000.0000 | - | 1107784272 B | NA |
// * Hints *
Outliers
Pretokenization.RefStructEnumerator: .NET 6.0 -> 3 outliers were removed (247.29 ms..252.96 ms)
Pretokenization.RegexPublicMlNetNuget: .NET 6.0 -> 1 outlier was removed (1.80 s)
Pretokenization.RegexCurrentMlNetGithub: .NET 6.0 -> 1 outlier was removed (722.77 ms)
Pretokenization.RefStructEnumerator: .NET 8.0 -> 4 outliers were removed, 5 outliers were detected (138.11 ms, 177.90 ms..178.81 ms)
Pretokenization.RegexCurrentMlNetGithub: .NET 8.0 -> 2 outliers were removed, 3 outliers were detected (309.58 ms, 316.50 ms, 321.06 ms)
Pretokenization.RegexPublicMlNetNuget: .NET Framework 4.8.1 -> 1 outlier was removed (3.43 s)
Always impressive how much difference just using recent .NET makes :) - great work there! Code is available here: https://github.com/georg-jung/TokenizerBenchmarks |
Thanks again @tarekgh for the suggestions and thoughts! By incorporating them I was able to add Thus, adding @SteveSandersonMS The API you use in |
Adds a new Microsoft.SemanticKernel.Connectors.Onnx component. As of this PR, it contains one service, BertOnnxTextEmbeddingGenerationService, for using BERT-based ONNX models to generate embeddings. But in time we can add more ONNX-based implementations for using local models. This is in part based on https://onnxruntime.ai/docs/tutorials/csharp/bert-nlp-csharp-console-app.html and https://github.com/dotnet-smartcomponents/smartcomponents. It doesn't support everything that's supported via sentence-transformers, but we should be able to extend it as needed. cc: @luisquintanilla, @SteveSandersonMS, @JakeRadMSFT
Adds a new Microsoft.SemanticKernel.Connectors.Onnx component. As of this PR, it contains one service, BertOnnxTextEmbeddingGenerationService, for using BERT-based ONNX models to generate embeddings. But in time we can add more ONNX-based implementations for using local models.
This is in part based on https://onnxruntime.ai/docs/tutorials/csharp/bert-nlp-csharp-console-app.html and https://github.com/dotnet-smartcomponents/smartcomponents. It doesn't support everything that's supported via sentence-transformers, but we should be able to extend it as needed.
cc: @luisquintanilla, @SteveSandersonMS, @JakeRadMSFT