Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook to demonstrate how efficiently running SetFit with ONNX #435

Merged
merged 4 commits into from
Nov 27, 2023

Conversation

MosheWasserb
Copy link
Collaborator

Efficiently run SetFit Models with Optimum

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@tomaarsen
Copy link
Member

Hello!

I've tried to reproduce your findings, but I get somewhat different results. In particular, for me distilbert and bge-small have roughly the same latency, e.g. here:
image

Beyond that, I can't export the model to ONNX without concerning warnings:

Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 1.13.1+cu117
Overriding 1 configuration item(s)
        - use_cache -> False
2023-11-21 11:22:39.1526897 [W:onnxruntime:, session_state.cc:1162 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-11-21 11:22:39.1569189 [W:onnxruntime:, session_state.cc:1164 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Overridding for_gpu=False to for_gpu=True as half precision is available only on GPU.
C:\Users\tom\.conda\envs\setfit\lib\site-packages\optimum\onnxruntime\configuration.py:770: FutureWarning: disable_embed_layer_norm will be deprecated soon, use disable_embed_layer_norm_fusion instead, disable_embed_layer_norm_fusion is set to True.
  warnings.warn(
Optimizing model...
2023-11-21 11:22:40.3541080 [W:onnxruntime:, session_state.cc:1162 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-11-21 11:22:40.3587720 [W:onnxruntime:, session_state.cc:1164 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
symbolic shape inference disabled or failed.
symbolic shape inference disabled or failed.
Configuration saved in bge_auto_opt_O4\ort_config.json
Optimized model saved at: bge_auto_opt_O4 (external data format: False; saved all tensor to one file: True)
Post-processing the exported models...
Deduplicating shared (tied) weights...
Validating models in subprocesses...
Validating ONNX model bge_auto_opt_O4/model.onnx...
2023-11-21 11:22:43.1815932 [W:onnxruntime:, session_state.cc:1162 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-11-21 11:22:43.1859957 [W:onnxruntime:, session_state.cc:1164 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
        -[✓] ONNX model output names match reference model (last_hidden_state)
        - Validating ONNX Model output "last_hidden_state":
                -[✓] (2, 16, 384) matches (2, 16, 384)
                -[x] values not close enough, max diff: 2.19140625 (atol: 0.0001)
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 0.0001:
- last_hidden_state: max diff = 2.19140625.
 The exported model was saved at: bge_auto_opt_O4
-[x] values not close enough, max diff: 2.19140625 (atol: 0.0001)

However, the model strangely still performs well, i.e. it still reaches 90.6% accuracy.
I can only run the exported model on CUDA, not on TensorRT. For TensorRT the model gets stuck for more than an hour at 100% GPU usage. When running on CUDA, the latency goes from ~11 to ~3.5, which is still a big improvement, but not quite as big as you see with TensorRT:
image

I'm okay with just fixing some typos and merging this PR, in the hopes that others don't get my issues. Perhaps at some point I can implement proper ONNX support for SentenceTransformers and SetFit so it works a bit more consistently.

Did you also get the -[x] values not close enough, max diff: 2.19140625 (atol: 0.0001) warning? And did you encounter issues with running TensorRT? And should I just get this PR merged with some typo fixes? I'm curious about your thoughts.

  • Tom Aarsen

@MosheWasserb
Copy link
Collaborator Author

MosheWasserb commented Nov 22, 2023

Thanks @tomaarsen
I noticed a high variance when calculating the latency time average for DistillBERT. I'm not sure what is the reason when running with Google Collab.
If you get a small variance in your experiments it's best to stick with your results.

I also see a similar warning when exporting to ONNX, and it is also seen in the original notebook shared by Ilyas Moutawwakil
(our colleague from Hugging Face) see here https://twitter.com/IlysMoutawwakil/status/1705215192425288017 and the original notebook https://colab.research.google.com/drive/10UAtpz26Gv2LtamT8j33LmI5UFQFwF4T?usp=sharing
Maybe best to ask Ilyas or @philschmid for the warning status and implications.

I see good performance when running CUDA (~2.5ms) and TennsorRT (~2.3ms) on my Google Collab. Didnt see any issue with exporting to TensorRT.

Moshe

@tomaarsen
Copy link
Member

I'll retry with TensorRT on Google Colab rather than locally. As for the warning, the model accuracy is still identical, so it seems like it's not something to be too concerned about. Thanks for the info!

  • Tom Aarsen

@tomaarsen
Copy link
Member

@MosheWasserb

I've continued trying to get TensorRT installed, now on Google Colab. It ends up failing when building the engine.
Beyond that, I realised that the pipe model from transformers was loaded on CPU, while the SetFit model was on CUDA, so the results were not fair.

This is with CUDA rather than TensorRT & the pipe on the GPU instead:
image

Perhaps I could remove the reference model from the notebook and focus only on the gains for using ONNX? I suppose that a DistilBERT model of 268MB will still be faster than a 127MB BERT model. I'm curious to hear your thoughts.

  • Tom Aarsen

@MosheWasserb
Copy link
Collaborator Author

@tomaarsen
I agree, the best is to remove distillBERT from the notebook. Let's focus on the gain of using ONNX for SetFIt .

@tomaarsen
Copy link
Member

Will do. I will try to incorporate this into a how-to guide in the SetFit documentation for v1.0.0.

@tomaarsen tomaarsen changed the base branch from main to v1.0.0-pre November 27, 2023 13:19
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@tomaarsen tomaarsen changed the title Add notebook to demonstrate how efficiently running SetFit with TensorRT Add notebook to demonstrate how efficiently running SetFit with ONNX Nov 27, 2023
@tomaarsen tomaarsen mentioned this pull request Nov 27, 2023
@tomaarsen tomaarsen merged commit 193f83f into v1.0.0-pre Nov 27, 2023
18 checks passed
@tomaarsen tomaarsen deleted the moshe branch November 27, 2023 13:58
@tomaarsen
Copy link
Member

Thanks for your work on this, Moshe!

@MosheWasserb
Copy link
Collaborator Author

Thank you to Tom for this great release!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants