Add notebook to demonstrate how efficiently running SetFit with ONNX #435

MosheWasserb · 2023-11-09T11:27:51Z

Efficiently run SetFit Models with Optimum

review-notebook-app · 2023-11-09T11:27:56Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

tomaarsen · 2023-11-21T10:29:16Z

Hello!

I've tried to reproduce your findings, but I get somewhat different results. In particular, for me distilbert and bge-small have roughly the same latency, e.g. here:

Beyond that, I can't export the model to ONNX without concerning warnings:

Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 1.13.1+cu117
Overriding 1 configuration item(s)
        - use_cache -> False
2023-11-21 11:22:39.1526897 [W:onnxruntime:, session_state.cc:1162 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-11-21 11:22:39.1569189 [W:onnxruntime:, session_state.cc:1164 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Overridding for_gpu=False to for_gpu=True as half precision is available only on GPU.
C:\Users\tom\.conda\envs\setfit\lib\site-packages\optimum\onnxruntime\configuration.py:770: FutureWarning: disable_embed_layer_norm will be deprecated soon, use disable_embed_layer_norm_fusion instead, disable_embed_layer_norm_fusion is set to True.
  warnings.warn(
Optimizing model...
2023-11-21 11:22:40.3541080 [W:onnxruntime:, session_state.cc:1162 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-11-21 11:22:40.3587720 [W:onnxruntime:, session_state.cc:1164 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
symbolic shape inference disabled or failed.
symbolic shape inference disabled or failed.
Configuration saved in bge_auto_opt_O4\ort_config.json
Optimized model saved at: bge_auto_opt_O4 (external data format: False; saved all tensor to one file: True)
Post-processing the exported models...
Deduplicating shared (tied) weights...
Validating models in subprocesses...
Validating ONNX model bge_auto_opt_O4/model.onnx...
2023-11-21 11:22:43.1815932 [W:onnxruntime:, session_state.cc:1162 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-11-21 11:22:43.1859957 [W:onnxruntime:, session_state.cc:1164 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
        -[✓] ONNX model output names match reference model (last_hidden_state)
        - Validating ONNX Model output "last_hidden_state":
                -[✓] (2, 16, 384) matches (2, 16, 384)
                -[x] values not close enough, max diff: 2.19140625 (atol: 0.0001)
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 0.0001:
- last_hidden_state: max diff = 2.19140625.
 The exported model was saved at: bge_auto_opt_O4

-[x] values not close enough, max diff: 2.19140625 (atol: 0.0001)

However, the model strangely still performs well, i.e. it still reaches 90.6% accuracy.
I can only run the exported model on CUDA, not on TensorRT. For TensorRT the model gets stuck for more than an hour at 100% GPU usage. When running on CUDA, the latency goes from ~11 to ~3.5, which is still a big improvement, but not quite as big as you see with TensorRT:

I'm okay with just fixing some typos and merging this PR, in the hopes that others don't get my issues. Perhaps at some point I can implement proper ONNX support for SentenceTransformers and SetFit so it works a bit more consistently.

Did you also get the -[x] values not close enough, max diff: 2.19140625 (atol: 0.0001) warning? And did you encounter issues with running TensorRT? And should I just get this PR merged with some typo fixes? I'm curious about your thoughts.

Tom Aarsen

MosheWasserb · 2023-11-22T12:20:18Z

Thanks @tomaarsen
I noticed a high variance when calculating the latency time average for DistillBERT. I'm not sure what is the reason when running with Google Collab.
If you get a small variance in your experiments it's best to stick with your results.

I also see a similar warning when exporting to ONNX, and it is also seen in the original notebook shared by Ilyas Moutawwakil
(our colleague from Hugging Face) see here https://twitter.com/IlysMoutawwakil/status/1705215192425288017 and the original notebook https://colab.research.google.com/drive/10UAtpz26Gv2LtamT8j33LmI5UFQFwF4T?usp=sharing
Maybe best to ask Ilyas or @philschmid for the warning status and implications.

I see good performance when running CUDA (~2.5ms) and TennsorRT (~2.3ms) on my Google Collab. Didnt see any issue with exporting to TensorRT.

Moshe

tomaarsen · 2023-11-22T13:17:46Z

I'll retry with TensorRT on Google Colab rather than locally. As for the warning, the model accuracy is still identical, so it seems like it's not something to be too concerned about. Thanks for the info!

Tom Aarsen

tomaarsen · 2023-11-24T10:21:11Z

@MosheWasserb

I've continued trying to get TensorRT installed, now on Google Colab. It ends up failing when building the engine.
Beyond that, I realised that the pipe model from transformers was loaded on CPU, while the SetFit model was on CUDA, so the results were not fair.

This is with CUDA rather than TensorRT & the pipe on the GPU instead:

Perhaps I could remove the reference model from the notebook and focus only on the gains for using ONNX? I suppose that a DistilBERT model of 268MB will still be faster than a 127MB BERT model. I'm curious to hear your thoughts.

Tom Aarsen

MosheWasserb · 2023-11-26T09:06:39Z

@tomaarsen
I agree, the best is to remove distillBERT from the notebook. Let's focus on the gain of using ONNX for SetFIt .

tomaarsen · 2023-11-26T10:06:41Z

Will do. I will try to incorporate this into a how-to guide in the SetFit documentation for v1.0.0.

HuggingFaceDocBuilderDev · 2023-11-27T13:23:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

…to moshe

tomaarsen · 2023-11-27T13:58:36Z

Thanks for your work on this, Moshe!

MosheWasserb · 2023-11-27T14:24:43Z

Thank you to Tom for this great release!

Add files via upload

d8177db

MosheWasserb requested a review from tomaarsen November 9, 2023 11:27

tomaarsen mentioned this pull request Nov 24, 2023

Fixed switched token_type_ids and attention_mask #412

Draft

tomaarsen changed the base branch from main to v1.0.0-pre November 27, 2023 13:19

Rename file, remove distilBERT, fix typos

e2cf782

tomaarsen added 2 commits November 27, 2023 14:30

Merge branch 'v1.0.0-pre' of https://github.com/huggingface/setfit in…

c5ea28d

…to moshe

Add ONNX tutorial to docs

c7f49ad

tomaarsen changed the title ~~Add notebook to demonstrate how efficiently running SetFit with TensorRT~~ Add notebook to demonstrate how efficiently running SetFit with ONNX Nov 27, 2023

tomaarsen mentioned this pull request Nov 27, 2023

Patch-1 #437

Closed

tomaarsen merged commit 193f83f into v1.0.0-pre Nov 27, 2023
18 checks passed

tomaarsen deleted the moshe branch November 27, 2023 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notebook to demonstrate how efficiently running SetFit with ONNX #435

Add notebook to demonstrate how efficiently running SetFit with ONNX #435

MosheWasserb commented Nov 9, 2023

review-notebook-app bot commented Nov 9, 2023

tomaarsen commented Nov 21, 2023

MosheWasserb commented Nov 22, 2023 •

edited

Loading

tomaarsen commented Nov 22, 2023

tomaarsen commented Nov 24, 2023

MosheWasserb commented Nov 26, 2023

tomaarsen commented Nov 26, 2023

HuggingFaceDocBuilderDev commented Nov 27, 2023

tomaarsen commented Nov 27, 2023

MosheWasserb commented Nov 27, 2023

Add notebook to demonstrate how efficiently running SetFit with ONNX #435

Add notebook to demonstrate how efficiently running SetFit with ONNX #435

Conversation

MosheWasserb commented Nov 9, 2023

review-notebook-app bot commented Nov 9, 2023

tomaarsen commented Nov 21, 2023

MosheWasserb commented Nov 22, 2023 • edited Loading

tomaarsen commented Nov 22, 2023

tomaarsen commented Nov 24, 2023

MosheWasserb commented Nov 26, 2023

tomaarsen commented Nov 26, 2023

HuggingFaceDocBuilderDev commented Nov 27, 2023

tomaarsen commented Nov 27, 2023

MosheWasserb commented Nov 27, 2023

MosheWasserb commented Nov 22, 2023 •

edited

Loading