model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367

sfallah · 2025-10-01T11:12:34Z

model: add support for EmbeddingGemma SentenceTransformers dense linear projections

Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.
See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/

…support

…ar projections Adding support for the Dense modules used in EmbeddingGemma models. EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone. See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/

danbev · 2025-10-01T14:07:00Z

My understanding of how SentenceTransformer works is that these modules are applied after the base model has produced its output. SentenceTransformer scans for numbered module directories (in this case 1_Pooling, 2_Dense, and 3_Dense) and applies them sequentially as post-processing steps.

This came up during development and it was decided not to include any of these modules in the llama.cpp conversion. The model should output the base transformer embeddings directly. Pooling can be optionally configured in llama.cpp and can be done in several different ways (mean, CLS, last token, etc.) or not at all, depending on the user's needs.

Including the Dense layers with a specific post-processing pipeline that assumes mean pooling will always be used, which reduces the flexibility of the pooling options provide. Additionally, users may want access to the raw token embeddings from the base model for their own use cases, rather than having the SentenceTransformer post-processing baked in. Keeping these as separate allows users to choose whether they want the SentenceTransformer behavior or the raw model outputs.

That's at least my take on this matter, but if others disagree I'm open to these changes. I just wanted to provide some background on the reasoning.

sfallah · 2025-10-01T16:39:42Z

@danbev
Thanks for your rapid feedback.

I had already anticipated the reasons why dense layers were not included in the first place.
And I understand your arguments for not including the additional SentenceTransformers (ST) modules — but in practice, for the client/user, it’s very helpful if they are already included.

Let’s take the example of my own project, where I’m using embeddinggemma for RAG.
First and foremost, I would face a quality issue if dense layers are not applied.
Besides the fact that, as far as I know, the MTEB benchmarks for embeddinggemma are done using the ST model (not just the base model), the quality issue can even be demonstrated with a trivial test.

Please see the example below:

Base Model:

A man is playing guitar:
	The dog plays in the garden: 0.5552932620048523
	A woman watches TV: 0.4669498801231384
	Do you like pizza?: 0.4650737941265106
	The new movie is so great: 0.438459575176239
I love pasta:
	Do you like pizza?: 0.676074206829071
	The new movie is so great: 0.5771214962005615
	A woman watches TV: 0.4970381259918213
	The dog plays in the garden: 0.4736071228981018

ST Model:

A man is playing guitar:
	The dog plays in the garden: 0.5132929682731628
	A woman watches TV: 0.43405574560165405
	The new movie is so great: 0.3694726824760437
	Do you like pizza?: 0.31990543007850647
I love pasta:
	Do you like pizza?: 0.6048902273178101
	A woman watches TV: 0.38842126727104187
	The new movie is so great: 0.3778345584869385
	The dog plays in the garden: 0.33146393299102783

The results become even more different (not to say worse) when MRL-reduced dimensions are used.
So for me as a user — someone who wants to use this for RAG and similar applications — only the full ST model is truly useful.
And if I were to apply the dense layers myself on the client side, it would be quite impractical and most likely inefficient.

As a user, I would have preferred the following:

That llama.cpp supports both base-model-only and ST-model modes.
That convert supports either full ST conversion or base-model-only conversion.
That both ST GGUF models and base models can be loaded.
(Optional) That applying the dense layers can be toggled in server embedding requests.

I could also imagine accommodating or implementing ST modules in a more generic way, similar to how LoRA adapters are handled.

Sorry for making this so long, but this model is an important one for users like me.
It’s very efficient and has a high MTEB ranking — but for that to hold true, the dense layers are crucial.

ggerganov · 2025-10-02T07:34:11Z

@sfallah Thanks for the detailed description - this is quite helpful. The main reason to not have support for Dense embedding modules implemented is that until recently (i.e. until our work with @danbev on EmbeddingGemma) I had no idea what their purpose is and how they are used. But now it is more clear.

We should add some way to support that. It seems it would involve generalizing/extending the pooling logic/API as well as (optionally) incorporating the modules (i.e. tensors) into the GGUFs during conversion.

(Optional) That applying the dense layers can be toggled in server embedding requests.

On first thought, the configuration of the dense modules would have to be done on the llama_context level, so dynamically switching the modules per request might not be possible to support. But if this is an important use case, we can think of ways to accommodate it, thought it would require changes in both the server embedding API and the libllama API 🤔

Since you have some first steps towards adding support for dense modules with this PR, we can continue with designing and implementing support for dense module configurations. Let me know if you are interested in putting some extra work into this, and I will try to provide steps how to proceed.

sfallah · 2025-10-02T08:05:47Z

@ggerganov
Thank you for taking the time to review this PR.
I’m very interested in collaborating with you and the team on this.

danbev · 2025-10-02T10:47:04Z

@sfallah Thanks for the detailed explanation! This does seem very important for RAG use cases.

I've added a #16387 for updating the model-conversion example (tool) which we've used for a few models now. I've tried this out with your pull request and it seems to work. Hopefully we can update as this as this work progresses and be prepared for future models that require the same type of features.

ggerganov · 2025-10-02T10:59:35Z

I wonder if there is very simple solution that we can do:

Add options during the conversion to specify which dense modules to include
Update llama.cpp graphs to look for the dense modules as it is done in this PR and unconditionally apply the tensors to the graph if they are present

This way a user can create a GGUF that either includes the dense modules or not depending on what they need. This makes the implementation much simpler as we don't have to extend the API. But it creates some burden for the user - they would have to be careful which GGUF they are using.

In any case, doing it like this is basically a first step towards the more general support later that would allow to turn on and off the dense modules during context creation.

sfallah · 2025-10-02T12:14:55Z

@ggerganov
I totally agree, if I may say, this is basically what I also meant with my suggestions.

In my opinion, it is not a burden for the user—at least not for me—to know if I will be deploying a gguf-model that includes dense layers or not. The same way, as I need to know which quantization type my gguf-model has.

The only issue is flexibility regarding pooling, which I would, in practice, not see as a problem because of the following reasons:

What pooling type is "practically" applicable to what model is model-dependent.
The option is practically binary: between the model's "default" pooling or none.

I know the second point is a bit oversimplified, but in practice, it is generally true for embedding and reranker models.

So I don't see any problem if, for example, in the case of the embeddinggemma model with dense layers included, the pooling set by the user is ignored, because dense layers require mean pooling.

…ar projections - converting model with dense-layers is optional - introduced dense config params

sfallah · 2025-10-04T08:17:45Z

Overview of Changes

Added the --sentence-transformers-dense-modules conversion option to support including Sentence Transformers (ST) dense layers.
- Currently, this option applies only to EmbeddingGemma.
Dense layers are now added to the graph when they are present in the GGUF file.
The configuration of dense modules is now read as the first step toward full, generic dense-module support.

About Module Configuration

By reading the dense-module configuration, we lay the groundwork for full linear projection support.
At the moment, EmbeddingGemma dense layers only include weight, and activation is identity.
But ST dense layers can represent complete linear projections, including both biases and non-identity activation.

convert_hf_to_gguf.py

src/llama-model.cpp

src/llama-hparams.h

iamlemec · 2025-10-06T06:15:19Z

Very cool! My main question here is: why not just add another pooling type DENSE? For the RANK case, we are already pulling in additional weight matrices for the classification head (post token pooling). Seems like this adds a bunch of flags and struct members that ar e very specific to EmbeddingGemma.

In fact, you could even use the existing tensor placeholders cls and cls_out to store the first and second dense layers. Then the code in build_pooling would be nearly identical to the RANK case (minus the tanh). As for the conversion, I would think you could just make some modifications to TextModel._try_set_pooling_type to detect the presence of dense layers. That looks through the modules.json file for more info.

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

sfallah · 2025-10-06T14:06:22Z

@iamlemec
Thank you for the feedback — much appreciated.
Here are my thoughts regarding your comments:

Seems like this adds a bunch of flags and struct members that are very specific to EmbeddingGemma.

The PR introduces struct members and flags that are intended to be generic for all Sentence-Transformer (ST) models, though they’re currently only applied to EmbeddingGemma.

I would think you could just make some modifications to TextModel._try_set_pooling_type to detect the presence of dense layers. That looks through the modules.json file for more info.

You're right — modules.json is already read in TextModel._try_set_pooling_type. Much of the logic introduced here could be moved up the class hierarchy to TextModel to make the ST feature more generic.
However, I intentionally kept it isolated to EmbeddingGemma for now. My reasoning:

The TextModel level feels a bit too general at this stage.
The additional ST feature (dense layers) currently applies only to this model, so keeping the implementation local reduces unintended side effects.
I believe that before introducing a more general mechanism, the converter implementation would benefit from a systematic refactoring. I’d like to propose that to the team separately.

In fact, you could even use the existing tensor placeholders cls and cls_out to store the first and second dense layers.

I understand the suggestion, but I’d prefer to maintain a conceptual distinction between Sentence-Transformer models and other architectures.

As you know, ST models—per the official structure—typically include modules.json, 1_Pooling, and optionally X_Dense, etc.

While some models (e.g., Qwen3-Reranker) have similar post-token pooling projections, their structure doesn’t strictly follow the ST format. I think preserving familiar ST naming conventions in the converted GGUF models will help users who are accustomed to the standard ST layout.

BTW, thank you again @iamlemec for your excellent work on embedding and reranker models in llama.cpp.

ggerganov

Overall looking good.

From what I understand, we currently assume that the output features from the dense modules would be equal to hparams.n_embd. I guess this is usually true, but maybe in the future we would have to add support for cases where it's not true. This would require to add some way for the user to query the number of output features through the libllama API.

Regarding @iamlemec proposal about adding a new DENSE pooling type - I think a better alternative would be to separate the RANK and DENSE concepts away from the pooling type. For example, reranking assumes a certain pooling type - currently LAST from what we've seen. But there is no guarantee that in the future some model would not use different pooling type to do reranking. Same argument is valid for the dense modules. So the 2 concepts would have to be separated. For now, this PR is OK - we can work on this after.

src/llama-hparams.h

src/llama-model.cpp

src/llama-graph.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

danbev · 2025-10-07T06:38:10Z

I've updated #16387 to support the --sentence-transformers-dense-modules flag.

- asserts checking dense features dims

danbev · 2025-10-07T11:39:27Z

This will require us to publish separate models for these "ST models" if I've understood this correctly?

If we take EmbeddingGemma as an example, this will add another 3 more models to the existing ones. This might not be a big deal but something worth thinking about and something that we should document so that it is not missed when publishing new models.

sfallah added 2 commits October 1, 2025 12:12

model: EmbeddingGemma sentence-transformers dense linear projections …

a0b83f6

…support

sfallah requested a review from CISC as a code owner October 1, 2025 11:12

github-actions bot added the python python script changes label Oct 1, 2025

CISC requested a review from danbev October 1, 2025 11:22

model: add support for EmbeddingGemma SentenceTransformers dense line…

f3be74e

…ar projections - converting model with dense-layers is optional - introduced dense config params

sfallah requested a review from ggerganov as a code owner October 4, 2025 06:58

Merge branch 'master' into embeddinggemma_sentence_transformers

5883eea

danbev reviewed Oct 6, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

src/llama-model.cpp Show resolved Hide resolved

src/llama-hparams.h Outdated Show resolved Hide resolved

sfallah and others added 2 commits October 6, 2025 10:00

Update convert_hf_to_gguf.py

f48b704

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

fixed formatting issues

e22325c

sfallah requested a review from danbev October 6, 2025 15:16

ggerganov reviewed Oct 7, 2025

View reviewed changes

src/llama-hparams.h Outdated Show resolved Hide resolved

src/llama-model.cpp Show resolved Hide resolved

src/llama-model.cpp Show resolved Hide resolved

src/llama-graph.cpp Outdated Show resolved Hide resolved

Update src/llama-graph.cpp

21b70cc

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

sfallah added 2 commits October 7, 2025 10:18

- removed pooling_type_opt, always allow overriding pooling_type

cc468d7

- asserts checking dense features dims

fix python lint

f78c724

sfallah added 2 commits October 7, 2025 14:19

fix ubuntu gcc build warning

f583a57

Merge branch 'ggml-org:master' into embeddinggemma_sentence_transformers

b768e9a

Merge branch 'ggml-org:master' into embeddinggemma_sentence_transformers

60be2e2

model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367

Are you sure you want to change the base?

model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367

Uh oh!

Conversation

sfallah commented Oct 1, 2025

model: add support for EmbeddingGemma SentenceTransformers dense linear projections

Uh oh!

danbev commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfallah commented Oct 1, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

sfallah commented Oct 2, 2025

Uh oh!

danbev commented Oct 2, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

sfallah commented Oct 2, 2025

Uh oh!

sfallah commented Oct 4, 2025

Overview of Changes

About Module Configuration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamlemec commented Oct 6, 2025

Uh oh!

sfallah commented Oct 6, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danbev commented Oct 7, 2025

Uh oh!

danbev commented Oct 7, 2025

Uh oh!

Uh oh!

danbev commented Oct 1, 2025 •

edited

Loading