-
Notifications
You must be signed in to change notification settings - Fork 13.2k
model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367
Conversation
…ar projections Adding support for the Dense modules used in EmbeddingGemma models. EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone. See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/
My understanding of how SentenceTransformer works is that these modules are applied after the base model has produced its output. SentenceTransformer scans for numbered module directories (in this case 1_Pooling, 2_Dense, and 3_Dense) and applies them sequentially as post-processing steps. This came up during development and it was decided not to include any of these modules in the llama.cpp conversion. The model should output the base transformer embeddings directly. Pooling can be optionally configured in llama.cpp and can be done in several different ways (mean, CLS, last token, etc.) or not at all, depending on the user's needs. Including the Dense layers with a specific post-processing pipeline that assumes mean pooling will always be used, which reduces the flexibility of the pooling options provide. Additionally, users may want access to the raw token embeddings from the base model for their own use cases, rather than having the SentenceTransformer post-processing baked in. Keeping these as separate allows users to choose whether they want the SentenceTransformer behavior or the raw model outputs. That's at least my take on this matter, but if others disagree I'm open to these changes. I just wanted to provide some background on the reasoning. |
@danbev I had already anticipated the reasons why dense layers were not included in the first place. Let’s take the example of my own project, where I’m using Please see the example below: Base Model:
ST Model:
The results become even more different (not to say worse) when MRL-reduced dimensions are used. As a user, I would have preferred the following:
I could also imagine accommodating or implementing ST modules in a more generic way, similar to how LoRA adapters are handled. Sorry for making this so long, but this model is an important one for users like me. |
@sfallah Thanks for the detailed description - this is quite helpful. The main reason to not have support for Dense embedding modules implemented is that until recently (i.e. until our work with @danbev on EmbeddingGemma) I had no idea what their purpose is and how they are used. But now it is more clear. We should add some way to support that. It seems it would involve generalizing/extending the pooling logic/API as well as (optionally) incorporating the modules (i.e. tensors) into the GGUFs during conversion.
On first thought, the configuration of the dense modules would have to be done on the Since you have some first steps towards adding support for dense modules with this PR, we can continue with designing and implementing support for dense module configurations. Let me know if you are interested in putting some extra work into this, and I will try to provide steps how to proceed. |
@ggerganov |
@sfallah Thanks for the detailed explanation! This does seem very important for RAG use cases. I've added a #16387 for updating the model-conversion example (tool) which we've used for a few models now. I've tried this out with your pull request and it seems to work. Hopefully we can update as this as this work progresses and be prepared for future models that require the same type of features. |
I wonder if there is very simple solution that we can do:
This way a user can create a GGUF that either includes the dense modules or not depending on what they need. This makes the implementation much simpler as we don't have to extend the API. But it creates some burden for the user - they would have to be careful which GGUF they are using. In any case, doing it like this is basically a first step towards the more general support later that would allow to turn on and off the dense modules during context creation. |
@ggerganov In my opinion, it is not a burden for the user—at least not for me—to know if I will be deploying a gguf-model that includes dense layers or not. The same way, as I need to know which quantization type my gguf-model has. The only issue is flexibility regarding pooling, which I would, in practice, not see as a problem because of the following reasons:
I know the second point is a bit oversimplified, but in practice, it is generally true for So I don't see any problem if, for example, in the case of the |
…ar projections - converting model with dense-layers is optional - introduced dense config params
Overview of Changes
About Module ConfigurationBy reading the dense-module configuration, we lay the groundwork for full linear projection support. |
Very cool! My main question here is: why not just add another pooling type In fact, you could even use the existing tensor placeholders |
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
@iamlemec
The PR introduces struct members and flags that are intended to be generic for all Sentence-Transformer (ST) models, though they’re currently only applied to EmbeddingGemma.
You're right —
I understand the suggestion, but I’d prefer to maintain a conceptual distinction between Sentence-Transformer models and other architectures. As you know, ST models—per the official structure—typically include While some models (e.g., BTW, thank you again @iamlemec for your excellent work on embedding and reranker models in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good.
From what I understand, we currently assume that the output features from the dense modules would be equal to hparams.n_embd
. I guess this is usually true, but maybe in the future we would have to add support for cases where it's not true. This would require to add some way for the user to query the number of output features through the libllama
API.
Regarding @iamlemec proposal about adding a new DENSE
pooling type - I think a better alternative would be to separate the RANK
and DENSE
concepts away from the pooling type. For example, reranking assumes a certain pooling type - currently LAST
from what we've seen. But there is no guarantee that in the future some model would not use different pooling type to do reranking. Same argument is valid for the dense modules. So the 2 concepts would have to be separated. For now, this PR is OK - we can work on this after.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
I've updated #16387 to support the |
- asserts checking dense features dims
This will require us to publish separate models for these "ST models" if I've understood this correctly? If we take EmbeddingGemma as an example, this will add another 3 more models to the existing ones. This might not be a big deal but something worth thinking about and something that we should document so that it is not missed when publishing new models. |
model: add support for EmbeddingGemma SentenceTransformers dense linear projections
Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.
See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/