Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Can Context FMHA be used to implement Transformer in a vision encoder for multimodal models? #2001

Closed
lmcl90 opened this issue Jul 23, 2024 · 4 comments
Labels
question Further information is requested stale

Comments

@lmcl90
Copy link

lmcl90 commented Jul 23, 2024

I see that the multi-model models in the example all use TensorRT directly to deploy vision encoders, why not use TensorRT-LLM? Are there known issues or challenges associated with integrating Context FMHA into visual encoders?

@QiJune QiJune added the question Further information is requested label Jul 23, 2024
@QiJune
Copy link
Collaborator

QiJune commented Jul 23, 2024

Yes, you can try to use TensorRT-LLM for the vision encoders. We have Bert example, DiT example, and community also contribute a SDXL model. I think it's not hard to develop a ViT model.

@lmcl90
Copy link
Author

lmcl90 commented Jul 23, 2024

@QiJune Thanks for your replay. I will have a try.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

@github-actions github-actions bot added the stale label Aug 23, 2024
Copy link

This issue was closed because it has been stalled for 15 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

2 participants