-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional support for T5 Tokenizer - SentencepieceTokenizer #828
Comments
Can you provide an example of inconsistence result from HF tokenizer here for investigation? And when you are saying 'sentinel tokens', are they somethings like ' <extra_id_0>...'? Some of tokens I think were handled by HF python code, the others are processed by SentencePiece library. May I ask how these tokens are used in your app? |
Hi @wenbingl ,
Yes, when I mentioned sentinel tokens, they are the tokens from
I am attaching the python script I used to convert and test the tokenizer here - import numpy as np
from transformers import T5TokenizerFast
from onnxruntime_extensions import gen_processing_models, get_library_path
import onnxruntime as ort
# Initialize the tokenizer
tokenizer = T5TokenizerFast.from_pretrained("t5-small")
text = "<extra_id_0> am looking foward to hearing from you."
input_ids = tokenizer.encode(text, return_tensors="np")
# Create the ONNX graphs for the tokenizer
# ort_tokenzer - Model to perform tokenization from string input to tokenized output
# ort_decoder - Model to perform decoding from tokenized input to string output
ort_tokenizer, _ = gen_processing_models(tokenizer, pre_kwargs={'CAST_TOKEN_ID': True})
# Save the ONNX graphs
with open("tokenizer.onnx", "wb") as f:
f.write(ort_tokenizer.SerializeToString())
# Run inference with the ONNX models
session_options = ort.SessionOptions()
session_options.register_custom_ops_library(get_library_path())
tokenizer_session = ort.InferenceSession("tokenizer.onnx", sess_options=session_options)
# Tokenize the input
actual_ids = tokenizer_session.run(None, {'inputs':[text]})[0]
print("HuggingFace Tokenizer IDs:", input_ids[0])
print("OnnxRuntime Tokenizer IDs:", actual_ids) And I got the following result while running the above script - HuggingFace Tokenizer IDs: [32099 183 479 5575 2239 12 3507 45 25 5 1]
OnnxRuntime Tokenizer IDs: [ 3 2 25666 834 23 26 834 632 3155 183 479 5575
2239 12 3507 45 25 5 1] |
Thanks, we will take a look at the issue. |
Hi @wenbingl , |
Is it possible to call ort-extensions via C API, like the following:
|
Hi @wenbingl , |
@r4ghu You might want to add the sentinel tokens manually to the protobuf file. See my comment here #852 (comment). |
Hi team,
I would like to request some support for adding additional features for T5Tokenizer / SentencepieceTokenizer. I was able to convert the HuggingFace T5 Tokenizer to Onnx format using the following code -
So far, the tokenizer works great without issues when I pass normal sentences. But when I add the sentinel tokens into my input sentence, the tokenizer behavior differs from the HuggingFace tokenizer. Can you please add some additional feature to support sentinel tokens in SentencepieceTokenizer? If it's possible to get this functionality working with a workaround of existing logic, I would like to know as it can simplify some preprocessing logic in my tokenization logic to handle sentinel tokens.
The text was updated successfully, but these errors were encountered: