[CPU][ARM64] Implemented JIT Emitter for Eltwise SoftPlus Operation #29242

srinjoydutta03 · 2025-03-03T11:33:02Z

Details:

Implemented and added jit_softplus_emitter derived class for element wise softplus operation
Added entry Algorithm::EltwiseSoftRelu, in executors/aarch64 as one of the supported algorithms
Added entry in the get_supported_precisions and create_eltwise_emitters in kernel/aarch64
Added utils::ActivationTypes::SoftPlus in jit kernel check in the tests in activation.cpp

Tests:

Passed all local tests using ./bin/arm64/Release/ov_cpu_func_tests --gtest_filter="*smoke*Activation*SoftPlus*"

Tickets:

Closes [Good First Issue] [ARM]: Implement CPU plugin just-in-time emitter for SoftPlus operation #24109

CC: @a-sidorova

a-sidorova

Good job 👍🏼

Left some comments

a-sidorova · 2025-03-04T05:05:33Z

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_eltwise_emitters.cpp

+    const TReg vmm_neg_mask(aux_vec_idxs[7]); // mask to indicate whether n is negative
+
+    h->ld1r(vmm_aux0.s, table_val2("exp_ln_flt_min_f")); // load min allowed value
+    h->fmaxnm(vmm_dst.s, vmm_src.s, vmm_aux0.s);


There might be situation when in_vec_idxs[0] == out_vec_idxs[0] - the same src and dst register.
In this case after this line vmm_src will contain not original values and we won't be able to use this register (for example, on L2782 the register may have incorrect original data).

May I ask you to handle such possible cases?

Oh, I see. Will first moving to an auxillary register from the source and then applying the fmaxnm operation be enough in this case?

a-sidorova · 2025-03-04T06:43:10Z

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_eltwise_emitters.cpp

+}
+
+size_t jit_softplus_emitter::get_aux_vecs_count() const {
+    return 8;


Since part of aux vec registers are passed to exp_emitter, I think we need to write some function with exp_emitter->get_aux_vecs_count() calling here? If exp_emitter requires (or will require) more than 8 registers, some vector registers will be spilled (saved on stack) during exp_emitter->emit_code() call.

What do you think?

I initially thought to put exp_emitter->get_aux_vecs_count()+4, as exp_emitter->get_aux_vecs_count() is returning 4. I tried with using more than 9 aux registers (few solely for exp_emitter for its own manipulation), but it won't let me allocate that many. I noticed the same in the elu_emitter where the get get_aux_vecs_count() function for it returns max(exp_emitter->get_aux_vecs_count()+1ull, 2ull). I'm wondering if I can apply the same logic while allocating my auxillary registers.

Like below,

const TReg vmm_aux0(aux_vec_idxs[0]); const TReg vmm_aux1(aux_vec_idxs[1]); const TReg vmm_aux2(aux_vec_idxs[2]); const TReg vmm_aux3(aux_vec_idxs[3]); const TReg vmm_aux4(aux_vec_idxs[exp_aux_count]); const TReg vmm_aux5(aux_vec_idxs[exp_aux_count + 1]); const TReg vmm_mask(aux_vec_idxs[exp_aux_count + 2]); const TReg vmm_neg_mask(aux_vec_idxs[exp_aux_count + 3]);

Is this approach correct?

a-sidorova · 2025-03-04T06:48:27Z

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_eltwise_emitters.cpp

+    h->fmul(vmm_dst.s, vmm_aux0.s, vmm_aux2.s);
+    h->fadd(vmm_dst.s, vmm_dst.s, vmm_aux4.s);


Could we use FMA instruction here (fmla)?

Sure, I will modify it. Thanks

a-sidorova · 2025-03-04T06:53:35Z

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_eltwise_emitters.cpp

+    const TReg vmm_mask(aux_vec_idxs[6]);
+    const TReg vmm_neg_mask(aux_vec_idxs[7]); // mask to indicate whether n is negative


Can we have only one mask register? I see that vmm_neg_mask is free (after L2778) when we initialize vmm_mask on L2782. Looks like we can reuse vmm_neg_mask in the code part which handle big values.

Sure, will update that

a-sidorova · 2025-03-04T06:57:19Z

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_eltwise_emitters.cpp

+template <dnnl::impl::cpu::aarch64::cpu_isa_t isa>
+void jit_softplus_emitter::emit_isa(const std::vector<size_t>& in_vec_idxs,
+                                   const std::vector<size_t>& out_vec_idxs) const {
+    using TReg = typename dnnl::impl::cpu::aarch64::cpu_isa_traits<isa>::TReg;


May I ask you to add please brief description of the implemented logic? What is the formula?

The implemented logic manipulates the base formula softplus(x) = ln(1+e^x) = ln(1+e^r*2^n) into two separate formula for positive x and negative x,

For positive we use softplus(x) = nln2 + ln((2^-(n-1)+2e^r)/2) to compute the log approximation

For negative we use softplus(x) = ln(1+e^x) = ln(1+e^r*2^n) = ln2 + ln(1 + (e^r*2^(n-1) - 0.5)), to compute the log approximation

a-sidorova · 2025-03-04T07:02:39Z

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_eltwise_emitters.cpp

+    h->fsub(vmm_aux0.s, vmm_aux2.s, vmm_aux4.s); // (e^r*2^(n-1) - 0.5)
+
+    // Log approximation of (1 + (e^r*2^(n-1) - 0.5))
+    h->ld1r(vmm_aux5.s, table_val2("log_pol6"));   


Could you please elaborate why we have to calculate two log approximation?

The base formula, softplus(x) = ln(1+e^x) = ln(1+e^r*2^n), where n is the quotient and r is the remainder when divided by ln2. We have two cases one for positive value of n, i.e., x >= 0 and for negative value of n, i.e. for x < 0.

For positive values we finally arrive at the formula; softplus(x) = nln2 + ln((2^-(n-1)+2e^r)/2), here we can calculate the log approximation since for n > 0, 2^-(n-1) will be small and will be in desirable range. The problem occurs when n is negative, which makes the 2^-(n-1) value very large and cannot be approximated with 8th order polynomial

For negative values we can inherently use this formula softplus(x) = ln(1+e^x) = ln(1+e^r*2^n), i have done these manipulations as shown in the comment in the code to get y in ln(1+y) in between [-0.5, 0] where we can accurately approximate for negative n

Implemented JIT SoftPlus Emitter for ARM64

009d0bc

srinjoydutta03 requested review from a team as code owners March 3, 2025 11:33

github-actions bot added the category: CPU OpenVINO CPU plugin label Mar 3, 2025

sys-openvino-ci added the ExternalPR External contributor label Mar 3, 2025

a-sidorova self-assigned this Mar 3, 2025

a-sidorova added this to the 2025.1 milestone Mar 3, 2025

a-sidorova added the platform: arm OpenVINO on ARM / ARM64 label Mar 3, 2025

a-sidorova approved these changes Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU][ARM64] Implemented JIT Emitter for Eltwise SoftPlus Operation #29242

[CPU][ARM64] Implemented JIT Emitter for Eltwise SoftPlus Operation #29242

srinjoydutta03 commented Mar 3, 2025

a-sidorova left a comment

a-sidorova Mar 4, 2025

srinjoydutta03 Mar 4, 2025

a-sidorova Mar 4, 2025

srinjoydutta03 Mar 4, 2025

a-sidorova Mar 4, 2025

srinjoydutta03 Mar 4, 2025

a-sidorova Mar 4, 2025

srinjoydutta03 Mar 4, 2025

a-sidorova Mar 4, 2025

srinjoydutta03 Mar 4, 2025

a-sidorova Mar 4, 2025

srinjoydutta03 Mar 4, 2025

		h->fmul(vmm_dst.s, vmm_aux0.s, vmm_aux2.s);
		h->fadd(vmm_dst.s, vmm_dst.s, vmm_aux4.s);

		const TReg vmm_mask(aux_vec_idxs[6]);
		const TReg vmm_neg_mask(aux_vec_idxs[7]); // mask to indicate whether n is negative

[CPU][ARM64] Implemented JIT Emitter for Eltwise SoftPlus Operation #29242

Are you sure you want to change the base?

[CPU][ARM64] Implemented JIT Emitter for Eltwise SoftPlus Operation #29242

Conversation

srinjoydutta03 commented Mar 3, 2025

Details:

Tests:

Tickets:

a-sidorova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment