Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clip : offload to GPU #4061

Closed
ggerganov opened this issue Nov 13, 2023 · 12 comments
Closed

clip : offload to GPU #4061

ggerganov opened this issue Nov 13, 2023 · 12 comments
Labels
good first issue Good for newcomers performance Speed related topics

Comments

@ggerganov
Copy link
Owner

With the recent support for running convolutions on the GPU (#4060) we should be able to offload CLIP to run fully on the GPU.

static ggml_cgraph * clip_image_build_graph(const clip_ctx * ctx, const clip_image_f32_batch * imgs) {
if (!ctx->has_vision_encoder) {
printf("This gguf file seems to have no vision encoder\n");
return nullptr;
}

@ggerganov ggerganov added good first issue Good for newcomers performance Speed related topics labels Nov 13, 2023
@cmp-nct
Copy link
Contributor

cmp-nct commented Nov 13, 2023

It seems minor but I believe supporting CLIP is a major step ahead, it's such a fundamental model

@ggerganov
Copy link
Owner Author

Ideally, CLIP should be supported as a separate model arch in llama.cpp, but it will take some extra work to achieve this: abetlen/llama-cpp-python#813 (comment)

We should do it at some point in the future.

@monatis
Copy link
Collaborator

monatis commented Nov 13, 2023

Ideally, CLIP should be supported as a separate model arch in llama.cpp,

Maybe we can start with porting full text and vision encoder parts from my clip.cpp to llama.cpp/examples/llava/clip.[h/cpp] and with the community's testing and feedback support we can polish the implementation gradually. Then we can include it directly in llama.cpp as an additional arch once we are confident about its public API and functionality. Or I can continue to develop it externally in that repo and merge later. @ggerganov WDYT?

@cmp-nct
Copy link
Contributor

cmp-nct commented Nov 14, 2023

I'd love to see full clip support in llama.cpp soon.
The current clip is cut down to only support what llava needed, the one of monatis contains a lot more functionality.
Imho we should aim to get the full support of features in, most important is probably llava but as standalone image analysis clip is very valuable.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 26, 2023

@ggerganov I have implemented broadcast for the ggml_add and ggml_mul operations (only for CPU and CUDA backend). I am just waiting for my pull request to be merged into stable diffusion and will then have some time to incorporate the changes I made in ggml.

@ggerganov
Copy link
Owner Author

Great! Would be great to PR them in llama.cpp (i.e. here) so that we can test CLIP performance.
I think I will be able to help with the Metal implementation

@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 26, 2023

Great! Would be great to PR them in llama.cpp (i.e. here) so that we can test CLIP performance. I think I will be able to help with the Metal implementation

See #4205, I think that, for now, we shouldn't merge that pull request until applying the changes I made to ggml in the main project. This way, we'll also have a more comprehensive implementation, eliminating the repeats and all that.

@y10ab1
Copy link
Contributor

y10ab1 commented Dec 8, 2023

Do we have any updates on this feature? I am eager to use it!

@cmp-nct
Copy link
Contributor

cmp-nct commented Dec 8, 2023

@ggerganov @FSSRepo
Would be awesome to get this pushed into ggml and llama.cpp
Did you see my discussion on CogVLM ? #4350
It's a vision model that beats GPT4-vision and should run well on 8-9GB VRAM quantized, it's the first time I have seen anything beating Open-AI. We definitely will need full CLIP offload but the main obstacle is that it has an additional architecture (more than just the 2 layers of projection layer of llava) that connects Vicuna-7 with Big-ViT.
I know it's a bit off topic here, just pushing that because I think it's so significant and totally overlooked

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 11, 2023

@cmp-nct It seems that the architecture of vision model CLIP and Llama differs from the implementation here. The truth is that there will be a lot of work to do if we want to have it here.

@cmp-nct
Copy link
Contributor

cmp-nct commented Dec 11, 2023

You are certainly right on the work required, it's likely about as much as the entire clip.cpp has been.

At this point it's the best thing we have in open source for vision, it's right at eye level to GPT4-Vision.
That's the first time I have seen anything open (or closed) to really compete with the best OpenAI has to offer.

For "simple vision" llava-1.5 (ShareGPT4V atm) is working great with clip.cpp
If we want real good vision with good OCR then CogVLM would be the current choice.

The only high level alternative is QwenVL which is significantly worse than CogVLM and about the same work to integrate here.

@ggerganov
Copy link
Owner Author

Done via #4205 and #4696

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers performance Speed related topics
Projects
Status: Done
Development

No branches or pull requests

5 participants