Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions src/llama-graph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1236,8 +1236,7 @@ llm_graph_input_attn_kv_unified * llm_graph_context::build_attn_inp_kv_unified()
auto inp = std::make_unique<llm_graph_input_attn_kv_unified>(hparams, cparams, kv_self);

{
GGML_ASSERT(hparams.n_swa_pattern == 1 && "Use llama_kv_cache_unified_iswa for SWA");
GGML_ASSERT(hparams.n_swa == 0 && "Use llama_kv_cache_unified_iswa for SWA");
GGML_ASSERT(hparams.swa_type == LLAMA_SWA_TYPE_NONE && "Use llama_kv_cache_unified_iswa for SWA");

const auto n_kv = kv_self->get_n();

Expand Down Expand Up @@ -1312,8 +1311,8 @@ llm_graph_input_attn_kv_unified_iswa * llm_graph_context::build_attn_inp_kv_unif
inp->self_kq_mask_cnv = cparams.flash_attn ? ggml_cast(ctx0, inp->self_kq_mask, GGML_TYPE_F16) : inp->self_kq_mask;
}

if (hparams.n_swa_pattern > 1) {
GGML_ASSERT(hparams.n_swa > 0 && "Use llama_kv_cache_unified for non-SWA");
{
GGML_ASSERT(hparams.swa_type != LLAMA_SWA_TYPE_NONE && "Use llama_kv_cache_unified for non-SWA");

const auto n_kv = kv_self->get_kv_swa()->get_n();

Expand Down
65 changes: 27 additions & 38 deletions src/llama-model.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -853,43 +853,16 @@ void llama_model::load_hparams(llama_model_loader & ml) {
default: type = LLM_TYPE_UNKNOWN;
}

// for backward compatibility ; see: https://github.com/ggerganov/llama.cpp/pull/8931
if ((hparams.n_layer == 32 || hparams.n_layer == 40) && hparams.n_ctx_train == 4096) {
// default value for Phi-3-mini-4k-instruct and Phi-3-medium-4k-instruct
LLAMA_LOG_WARN("%s: assuming n_swa = 2047 for Phi-3-mini-4k-instruct and Phi-3-medium-4k-instruct\n", __func__);
Comment on lines -856 to -859
Copy link
Collaborator

@ngxson ngxson May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't saw your review request.

Hmm I think these configs still need to be kept. The original issue talked about phi-4 and not phi-3, and these branches of code was to make sure old phi-3 model works correctly.

Probably phi-4 conflicts with this but I'll have a look. In anw, I think n_swa should still be set (it can't be forced to 0), so that the model can mask out tokens correctly for long sequence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that n_swa value never does anything when n_swa_pattern == 1, which has always been like this. So unless I am missing something, these configs didn't do anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok it seems like n_swa_pattern = 1 meaning all layers use SWA. I had a look at modeling_phi3.py and seems like use_sliding_windows is set to the same value globally.

In anw, I think we can just remove SWA altogether for phi-3/phi-4 if you feel like we spent too much efforts fixing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With our current logic n_swa_pattern = 1 actually means that none of the layers use SWA:

bool llama_hparams::is_swa(uint32_t il) const {
if (il < n_layer) {
return n_swa > 0 && n_swa_pattern > 0 && il % n_swa_pattern < (n_swa_pattern - 1);
}
GGML_ABORT("fatal error");
}

Some models have a parameter dense_attention_every_n_layers:

https://huggingface.co/microsoft/Phi-3-small-128k-instruct/blob/main/config.json#L18

This probably corresponds to our n_swa_pattern, but the conversion scripts and model loading logic completely ignores this atm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I haven't noticed about that dense_attention_every_n_layers. I think for now let's just disable the whole thing (as you are currently doing) and come back to this later. I don't think many users are still using these models, especially using it for long context. So probably it's not worth our time trying to fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the only correct way to handle all Phi models is to update the convert scripts to provide information about the dense/swa layer pattern, which we can then use to create the respective cache. We also have to rework n_swa_pattern in some way to support "all-SWA layers" - maybe n_swa_pattern == 0 could correspond to this:

  • n_swa_pattern == 0: all layers are SWA
  • n_swa_pattern == 1: all layers are non-SWA
  • n_swa_pattern == n: every nth layer is non-SWA

Merging this for now as I agree it's more effort to resolve and this was incorrect even before the recent SWA changes.

const bool found_swa = ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW, hparams.n_swa, false);

hparams.swa_type = LLAMA_SWA_TYPE_STANDARD;

hparams.n_swa = 2047;
} else if (hparams.n_layer == 32 && hparams.n_head_kv(0) == 32 && hparams.n_ctx_train == 131072) {
// default value for Phi-3-mini-128k-instruct
LLAMA_LOG_WARN("%s: assuming no SWA for Phi-3-mini-128k-instruct\n", __func__);

hparams.swa_type = LLAMA_SWA_TYPE_NONE;

hparams.n_swa = hparams.n_ctx_train;
hparams.n_swa_pattern = 1;
} else if (hparams.n_layer == 40 && hparams.n_ctx_train == 131072) {
// default value for Phi-3-medium-128k-instruct
LLAMA_LOG_WARN("%s: assuming no SWA for Phi-3-medium-128k-instruct\n", __func__);
if (found_swa && hparams.n_swa > 0) {
LLAMA_LOG_WARN("%s: Phi SWA is currently disabled - results might be suboptimal for some models (see %s)\n",
__func__, "https://github.com/ggml-org/llama.cpp/pull/13676");

// TODO: fix conversion scripts to correctly populate `n_swa` and `n_swa_pattern`
hparams.swa_type = LLAMA_SWA_TYPE_NONE;

hparams.n_swa = hparams.n_ctx_train;
hparams.n_swa_pattern = 1;
}

bool found_swa = ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW, hparams.n_swa, false);
if (!found_swa && hparams.n_swa == 0) {
throw std::runtime_error("invalid value for sliding_window");
}

if (hparams.n_swa > hparams.n_ctx_train) {
LLAMA_LOG_WARN("%s: unexpected n_swa: %d >= %d, disabling SWA\n", __func__, hparams.n_swa, hparams.n_ctx_train);

hparams.swa_type = LLAMA_SWA_TYPE_NONE;

hparams.n_swa = hparams.n_ctx_train;
hparams.n_swa = 0;
hparams.n_swa_pattern = 1;
}
} break;
Expand Down Expand Up @@ -7368,8 +7341,9 @@ struct llm_build_phi2 : public llm_graph_context {
}
};

struct llm_build_phi3_iswa : public llm_graph_context {
llm_build_phi3_iswa(const llama_model & model, const llm_graph_params & params, ggml_cgraph * gf) : llm_graph_context(params) {
template<bool iswa>
struct llm_build_phi3 : public llm_graph_context {
llm_build_phi3(const llama_model & model, const llm_graph_params & params, ggml_cgraph * gf) : llm_graph_context(params) {
const int64_t n_embd_head = hparams.n_embd_head_v;
const int64_t n_embd_gqa = hparams.n_embd_v_gqa();

Expand All @@ -7383,7 +7357,14 @@ struct llm_build_phi3_iswa : public llm_graph_context {
// inp_pos - contains the positions
ggml_tensor * inp_pos = build_inp_pos();

auto * inp_attn = build_attn_inp_kv_unified_iswa();
using inp_attn_type = std::conditional_t<iswa, llm_graph_input_attn_kv_unified_iswa, llm_graph_input_attn_kv_unified>;
inp_attn_type * inp_attn = nullptr;

if constexpr (iswa) {
inp_attn = build_attn_inp_kv_unified_iswa();
} else {
inp_attn = build_attn_inp_kv_unified();
}

for (int il = 0; il < n_layer; ++il) {
auto * residual = inpL;
Expand Down Expand Up @@ -13232,7 +13213,9 @@ llama_memory_i * llama_model::create_memory(const llama_memory_params & params,

LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

if (hparams.n_swa > 0) {
if (hparams.swa_type != LLAMA_SWA_TYPE_NONE) {
GGML_ASSERT(hparams.n_swa_pattern != 1);

res = new llama_kv_cache_unified_iswa(
*this,
params.type_k,
Expand All @@ -13245,6 +13228,8 @@ llama_memory_i * llama_model::create_memory(const llama_memory_params & params,
cparams.n_batch,
padding);
} else {
GGML_ASSERT(hparams.n_swa_pattern == 1);

res = new llama_kv_cache_unified(
*this,
nullptr,
Expand Down Expand Up @@ -13353,7 +13338,11 @@ llm_graph_result_ptr llama_model::build_graph(
case LLM_ARCH_PHI3:
case LLM_ARCH_PHIMOE:
{
llm = std::make_unique<llm_build_phi3_iswa>(*this, params, gf);
if (hparams.swa_type != LLAMA_SWA_TYPE_NONE) {
llm = std::make_unique<llm_build_phi3<true>> (*this, params, gf);
} else {
llm = std::make_unique<llm_build_phi3<false>>(*this, params, gf);
}
} break;
case LLM_ARCH_PLAMO:
{
Expand Down
Loading