Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes based on Issue #28731 ((System.Numerics) Cross Product for Ve… #31884

Closed
wants to merge 1 commit into from

Conversation

micampbell
Copy link

…ctor2 and Vector4). Added a Cross (product) function for Vector2 and Vector4.

…for Vector2 and Vector4). Added a Cross (product) function for Vector2 and Vector4.
@micampbell
Copy link
Author

please see discussion at #28731

@saucecontrol
Copy link
Member

[Intrinsic] indicates that the JIT has a special implementation of a method it can emit in place of the managed version. For example, the SIMD versions of Dot Product are implemented in the JIT here:

//--------------------------------------------------------------------------------
// genSIMDIntrinsicDotProduct: Generate code for SIMD Intrinsic Dot Product.
//
// Arguments:
// simdNode - The GT_SIMD node
//
// Return Value:
// None.
//
void CodeGen::genSIMDIntrinsicDotProduct(GenTreeSIMD* simdNode)
{
assert(simdNode->gtSIMDIntrinsicID == SIMDIntrinsicDotProduct);
GenTree* op1 = simdNode->gtGetOp1();
GenTree* op2 = simdNode->gtGetOp2();
var_types baseType = simdNode->gtSIMDBaseType;
var_types simdType = op1->TypeGet();
// TODO-1stClassStructs: Temporary to minimize asmDiffs
if (simdType == TYP_DOUBLE)
{
simdType = TYP_SIMD8;
}
var_types simdEvalType = (simdType == TYP_SIMD12) ? TYP_SIMD16 : simdType;
regNumber targetReg = simdNode->GetRegNum();
assert(targetReg != REG_NA);
var_types targetType = simdNode->TypeGet();
assert(targetType == baseType);
genConsumeOperands(simdNode);
regNumber op1Reg = op1->GetRegNum();
regNumber op2Reg = op2->GetRegNum();
regNumber tmpReg1 = REG_NA;
regNumber tmpReg2 = REG_NA;
SIMDLevel level = compiler->getSIMDSupportLevel();
// Dot product intrinsic is supported only on float/double vectors
// and 32-byte int vectors on AVX.
//
// Float/Double Vectors:
// For SSE, or AVX with 32-byte vectors, we need one additional Xmm register
// different from targetReg as scratch. Note that if this is a TYP_SIMD16 or
// smaller on AVX, then we don't need a tmpReg.
//
// 32-byte integer vector on AVX: we need two additional Xmm registers
// different from targetReg as scratch.
//
// 16-byte integer vector on SSE4: we need one additional Xmm register
// different from targetReg as scratch.
if (varTypeIsFloating(baseType))
{
if ((compiler->getSIMDSupportLevel() == SIMD_SSE2_Supported) || (simdEvalType == TYP_SIMD32))
{
tmpReg1 = simdNode->GetSingleTempReg();
assert(tmpReg1 != targetReg);
}
else
{
assert(simdNode->AvailableTempRegCount() == 0);
}
}
else
{
assert(baseType == TYP_INT);
assert(level >= SIMD_SSE4_Supported);
if (level == SIMD_SSE4_Supported)
{
tmpReg1 = simdNode->GetSingleTempReg();
}
else
{
tmpReg1 = simdNode->ExtractTempReg();
tmpReg2 = simdNode->GetSingleTempReg();
}
}
if (level == SIMD_SSE2_Supported)
{
// We avoid reg move if either op1Reg == targetReg or op2Reg == targetReg
if (op1Reg == targetReg)
{
// Best case
// nothing to do, we have registers in the right place
}
else if (op2Reg == targetReg)
{
op2Reg = op1Reg;
}
else
{
inst_RV_RV(ins_Copy(simdType), targetReg, op1Reg, simdEvalType, emitActualTypeSize(simdType));
}
// DotProduct(v1, v2)
// Here v0 = targetReg, v1 = op1Reg, v2 = op2Reg and tmp = tmpReg1
if ((simdNode->gtFlags & GTF_SIMD12_OP) != 0)
{
assert(baseType == TYP_FLOAT);
// v0 = v1 * v2
// tmp = v0 // v0 = (3, 2, 1, 0) - each element is given by its
// // position
// tmp = shuffle(tmp, tmp, SHUFFLE_ZXXY) // tmp = (2, 0, 0, 1) - don't really care what's in upper
// // bits
// v0 = v0 + tmp // v0 = (3+2, 0+2, 1+0, 0+1)
// tmp = shuffle(tmp, tmp, SHUFFLE_XXWW) // tmp = ( 1, 1, 2, 2)
// v0 = v0 + tmp // v0 = (1+2+3, 0+1+2, 0+1+2, 0+1+2)
//
inst_RV_RV(INS_mulps, targetReg, op2Reg);
inst_RV_RV(INS_movaps, tmpReg1, targetReg);
inst_RV_RV_IV(INS_shufps, EA_16BYTE, tmpReg1, tmpReg1, (int8_t)SHUFFLE_ZXXY);
inst_RV_RV(INS_addps, targetReg, tmpReg1);
inst_RV_RV_IV(INS_shufps, EA_16BYTE, tmpReg1, tmpReg1, (int8_t)SHUFFLE_XXWW);
inst_RV_RV(INS_addps, targetReg, tmpReg1);
}
else if (baseType == TYP_FLOAT)
{
// v0 = v1 * v2
// tmp = v0 // v0 = (3, 2, 1, 0) - each element is given by its
// // position
// tmp = shuffle(tmp, tmp, SHUFFLE_ZWXY) // tmp = (2, 3, 0, 1)
// v0 = v0 + tmp // v0 = (3+2, 2+3, 1+0, 0+1)
// tmp = v0
// tmp = shuffle(tmp, tmp, SHUFFLE_XYZW) // tmp = (0+1, 1+0, 2+3, 3+2)
// v0 = v0 + tmp // v0 = (0+1+2+3, 0+1+2+3, 0+1+2+3, 0+1+2+3)
// // Essentially horizontal addition of all elements.
// // We could achieve the same using SSEv3 instruction
// // HADDPS.
//
inst_RV_RV(INS_mulps, targetReg, op2Reg);
inst_RV_RV(INS_movaps, tmpReg1, targetReg);
inst_RV_RV_IV(INS_shufps, EA_16BYTE, tmpReg1, tmpReg1, (int8_t)SHUFFLE_ZWXY);
inst_RV_RV(INS_addps, targetReg, tmpReg1);
inst_RV_RV(INS_movaps, tmpReg1, targetReg);
inst_RV_RV_IV(INS_shufps, EA_16BYTE, tmpReg1, tmpReg1, (int8_t)SHUFFLE_XYZW);
inst_RV_RV(INS_addps, targetReg, tmpReg1);
}
else
{
assert(baseType == TYP_DOUBLE);
// v0 = v1 * v2
// tmp = v0 // v0 = (1, 0) - each element is given by its position
// tmp = shuffle(tmp, tmp, Shuffle(0,1)) // tmp = (0, 1)
// v0 = v0 + tmp // v0 = (1+0, 0+1)
inst_RV_RV(INS_mulpd, targetReg, op2Reg);
inst_RV_RV(INS_movaps, tmpReg1, targetReg);
inst_RV_RV_IV(INS_shufpd, EA_16BYTE, tmpReg1, tmpReg1, 0x01);
inst_RV_RV(INS_addpd, targetReg, tmpReg1);
}
}
else
{
assert(level >= SIMD_SSE4_Supported);
if (varTypeIsFloating(baseType))
{
// We avoid reg move if either op1Reg == targetReg or op2Reg == targetReg.
// Note that this is a duplicate of the code above for SSE, but in the AVX case we can eventually
// use the 3-op form, so that we can avoid these copies.
// TODO-CQ: Add inst_RV_RV_RV_IV().
if (op1Reg == targetReg)
{
// Best case
// nothing to do, we have registers in the right place
}
else if (op2Reg == targetReg)
{
op2Reg = op1Reg;
}
else
{
inst_RV_RV(ins_Copy(simdType), targetReg, op1Reg, simdEvalType, emitActualTypeSize(simdType));
}
emitAttr emitSize = emitActualTypeSize(simdEvalType);
if (baseType == TYP_FLOAT)
{
// dpps computes the dot product of the upper & lower halves of the 32-byte register.
// Notice that if this is a TYP_SIMD16 or smaller on AVX, then we don't need a tmpReg.
unsigned mask = ((simdNode->gtFlags & GTF_SIMD12_OP) != 0) ? 0x71 : 0xf1;
assert((mask >= 0) && (mask <= 255));
inst_RV_RV_IV(INS_dpps, emitSize, targetReg, op2Reg, (int8_t)mask);
// dpps computes the dot product of the upper & lower halves of the 32-byte register.
// Notice that if this is a TYP_SIMD16 or smaller on AVX, then we don't need a tmpReg.
// If this is TYP_SIMD32, we need to combine the lower & upper results.
if (simdEvalType == TYP_SIMD32)
{
GetEmitter()->emitIns_R_R_I(INS_vextractf128, EA_32BYTE, tmpReg1, targetReg, 0x01);
inst_RV_RV(INS_addps, targetReg, tmpReg1, targetType, emitTypeSize(targetType));
}
}
else if (baseType == TYP_DOUBLE)
{
if (simdEvalType == TYP_SIMD32)
{
// targetReg = targetReg * op2Reg
// targetReg = vhaddpd(targetReg, targetReg) ; horizontal sum of lower & upper halves
// tmpReg = vextractf128(targetReg, 1) ; Moves the upper sum into tempReg
// targetReg = targetReg + tmpReg1
inst_RV_RV(INS_mulpd, targetReg, op2Reg, simdEvalType, emitActualTypeSize(simdType));
inst_RV_RV(INS_haddpd, targetReg, targetReg, simdEvalType, emitActualTypeSize(simdType));
GetEmitter()->emitIns_R_R_I(INS_vextractf128, EA_32BYTE, tmpReg1, targetReg, 0x01);
inst_RV_RV(INS_addpd, targetReg, tmpReg1, targetType, emitTypeSize(targetType));
}
else
{
// On AVX, we have no 16-byte vectors of double. Note that, if we did, we could use
// dppd directly.
assert(level == SIMD_SSE4_Supported);
inst_RV_RV_IV(INS_dppd, emitSize, targetReg, op2Reg, 0x31);
}
}
}
else
{
// Dot product of 32-byte int vector on SSE4/AVX.
assert(baseType == TYP_INT);
assert(simdEvalType == TYP_SIMD16 || simdEvalType == TYP_SIMD32);
#ifdef DEBUG
// SSE4: We need 1 scratch register.
// AVX2: We need 2 scratch registers.
if (simdEvalType == TYP_SIMD16)
{
assert(tmpReg1 != REG_NA);
}
else
{
assert(tmpReg1 != REG_NA);
assert(tmpReg2 != REG_NA);
}
#endif
// tmpReg1 = op1 * op2
if (level == SIMD_AVX2_Supported)
{
// On AVX take advantage 3 operand form of pmulld
inst_RV_RV_RV(INS_pmulld, tmpReg1, op1Reg, op2Reg, emitTypeSize(simdEvalType));
}
else
{
inst_RV_RV(ins_Copy(simdEvalType), tmpReg1, op1Reg, simdEvalType);
inst_RV_RV(INS_pmulld, tmpReg1, op2Reg, simdEvalType);
}
if (simdEvalType == TYP_SIMD32)
{
// tmpReg2[127..0] = Upper 128-bits of tmpReg1
GetEmitter()->emitIns_R_R_I(INS_vextractf128, EA_32BYTE, tmpReg2, tmpReg1, 0x01);
// tmpReg1[127..0] = tmpReg1[127..0] + tmpReg2[127..0]
// This will compute
// tmpReg1[0] = op1[0]*op2[0] + op1[4]*op2[4]
// tmpReg1[1] = op1[1]*op2[1] + op1[5]*op2[5]
// tmpReg1[2] = op1[2]*op2[2] + op1[6]*op2[6]
// tmpReg1[4] = op1[4]*op2[4] + op1[7]*op2[7]
inst_RV_RV(INS_paddd, tmpReg1, tmpReg2, TYP_SIMD16, EA_16BYTE);
}
// This horizontal add will compute
//
// TYP_SIMD16:
// tmpReg1[0] = tmpReg1[2] = op1[0]*op2[0] + op1[1]*op2[1]
// tmpReg1[1] = tmpReg1[3] = op1[2]*op2[2] + op1[4]*op2[4]
//
// TYP_SIMD32:
// tmpReg1[0] = tmpReg1[2] = op1[0]*op2[0] + op1[4]*op2[4] + op1[1]*op2[1] + op1[5]*op2[5]
// tmpReg1[1] = tmpReg1[3] = op1[2]*op2[2] + op1[6]*op2[6] + op1[4]*op2[4] + op1[7]*op2[7]
inst_RV_RV(INS_phaddd, tmpReg1, tmpReg1, TYP_SIMD16, EA_16BYTE);
// DotProduct(op1, op2) = tmpReg1[0] = tmpReg1[0] + tmpReg1[1]
inst_RV_RV(INS_phaddd, tmpReg1, tmpReg1, TYP_SIMD16, EA_16BYTE);
// TargetReg = integer result from tmpReg1
// (Note that for mov_xmm2i, the int register is always in the reg2 position)
inst_RV_RV(INS_mov_xmm2i, tmpReg1, targetReg, TYP_INT);
}
}
genProduceReg(simdNode);
}

It's not likely you can improve on Vector2.Cross with SIMD, so you'll probably want to just move that as is into Vector2.cs and remove the [Intrinsic] attribute.

For Vector4.Cross, a SIMD version would be faster, but it probably won't be necessary to make it a JIT intrinsic once #952 is done (#1280 covered part of this).

Once you can cast Vector4 to Vector128<float> and back for free, you'll be able to write the SIMD version using System.Runtime.Intrinsics with a scalar fallback in the same way some of the Matrix4x4 methods are implemented now. e.g.

/// <summary>
/// Transposes the rows and columns of a matrix.
/// </summary>
/// <param name="matrix">The source matrix.</param>
/// <returns>The transposed matrix.</returns>
public static unsafe Matrix4x4 Transpose(Matrix4x4 matrix)
{
if (Sse.IsSupported)
{
var row1 = Sse.LoadVector128(&matrix.M11);
var row2 = Sse.LoadVector128(&matrix.M21);
var row3 = Sse.LoadVector128(&matrix.M31);
var row4 = Sse.LoadVector128(&matrix.M41);
var l12 = Sse.UnpackLow(row1, row2);
var l34 = Sse.UnpackLow(row3, row4);
var h12 = Sse.UnpackHigh(row1, row2);
var h34 = Sse.UnpackHigh(row3, row4);
Sse.Store(&matrix.M11, Sse.MoveLowToHigh(l12, l34));
Sse.Store(&matrix.M21, Sse.MoveHighToLow(l34, l12));
Sse.Store(&matrix.M31, Sse.MoveLowToHigh(h12, h34));
Sse.Store(&matrix.M41, Sse.MoveHighToLow(h34, h12));
return matrix;
}
Matrix4x4 result;
result.M11 = matrix.M11;
result.M12 = matrix.M21;
result.M13 = matrix.M31;
result.M14 = matrix.M41;
result.M21 = matrix.M12;
result.M22 = matrix.M22;
result.M23 = matrix.M32;
result.M24 = matrix.M42;
result.M31 = matrix.M13;
result.M32 = matrix.M23;
result.M33 = matrix.M33;
result.M34 = matrix.M43;
result.M41 = matrix.M14;
result.M42 = matrix.M24;
result.M43 = matrix.M34;
result.M44 = matrix.M44;
return result;
}

That SIMD version might be able to wait for a followup PR, so you could move that one as-is to Vector4.cs and remove the [Intrinsic] attribute from that as well.

@micampbell
Copy link
Author

micampbell commented Feb 24, 2020

[Intrinsic] indicates that the JIT has a special implementation of a method it can emit in place of the managed version. For example, the SIMD versions of Dot Product are implemented in the JIT here:

Thanks for the info. I wasn't really understanding the Intrinsic until now. The CPP code for dot product is beyond me, but it leads me to a new idea. In the vector and matrix classes there are various methods that invoke simple math operations: a * b + c * d (like a dot product) or a * b - c * d. Actually, the latter "difference of products" is really what we see alot in GetDeterminant, Invert, and Cross. It would seem a special SIMD function doing a * b - c * d could be used to great effect.

@saucecontrol
Copy link
Member

saucecontrol commented Feb 24, 2020

Dot product is a no-brainer for SIMD acceleration because SSE4.1 has a single instruction (DPPS) that computes a dot product on 1-4 elements of 2 float vectors. AVX further expands that with VDPPS, which can compute two dot products in a single instruction. There is no equivalent instruction for the difference of products.

The JIT code has fallback implementations to use other SIMD instructions when SSE4.1 is not available or when the element type is not float. It works out that the SIMD implementation is actually less efficient than the scalar implementation for Vector2 when the SSE fallback is used:

inst_RV_RV(INS_mulps, targetReg, op2Reg);
inst_RV_RV(INS_movaps, tmpReg1, targetReg);
inst_RV_RV_IV(INS_shufps, EA_16BYTE, tmpReg1, tmpReg1, (int8_t)SHUFFLE_ZWXY);
inst_RV_RV(INS_addps, targetReg, tmpReg1);
inst_RV_RV(INS_movaps, tmpReg1, targetReg);
inst_RV_RV_IV(INS_shufps, EA_16BYTE, tmpReg1, tmpReg1, (int8_t)SHUFFLE_XYZW);
inst_RV_RV(INS_addps, targetReg, tmpReg1);

However, because maintenance of the JIT code is very expensive relative to maintenance of the managed code, the work has not been done to optimize for every case.

The ability to use the individual SIMD instructions from System.Runtime.Intrinsics does open up the possibility of optimizing more areas of System.Numerics without the cost of writing new higher-level Intrinsics, and it's likely best to approach any 'difference of product' optimizations that way.

@danmoseley
Copy link
Member

@tannergooding what's the next action here? Should we close this pending the issue ref'd above?

@tannergooding
Copy link
Member

The referenced issue is resolved and shouldn't block optimizing the Vector4 implementation. It's only Vector2/3 where the conversion is not quite zero-cost.

@micampbell
Copy link
Author

I'm sorry for my delay in responding. I've not had the bandwidth or resources during quarantine to respond. In re-reading @saucecontrol 's posts, I'm still wondering if the SSE4.1 Dot implementation could be used to speed up functions like Cross, GetDeterminant, and Invert. One could just call Dot in the body of these functions. Would it be faster? For example:

        public static double Cross(Vector2 value1, Vector2 value2)
        {
            //instead of return value1.X * value2.Y - value1.Y * value2.X;
            return Dot(value1, new Vector2(value2.Y, -value2.X));
        }

@tannergooding
Copy link
Member

I wouldn't expect so for Vector2, the actual cost of building the new vector is likely greater than the mul, mul, sub that would otherwise exist.

@stephentoub
Copy link
Member

@micampbell, are you still working on this, or should it be closed for now? Thanks.

@jeffhandley
Copy link
Member

@tannergooding @pgovind Is this something you'd like to pick up and finish, or should we go ahead and close it?

@tannergooding
Copy link
Member

I think this can be closed for now and the original issue marked up for grabs again, possibly with revisiting the names used.
In particular, there was some discussion around what this is doing for the 4D cross product vs what other libraries, such as DirectX Math do.

If you take a cross product to be essentially that Cross(u, v) is always perpendicular to both u and v and then extrapolate from that, then you will eventually conclude that for a cross-product of an n dimensional vector, you need n-1 inputs.

Vector2

This means that a "true" 2-dimensional cross product takes one input and is effectively (Y, -X).
The operation being exposed here, in DXMath, and that is commonly exposed for for 2D graphics is actually the wedge product: https://mathworld.wolfram.com/WedgeProduct.html

Vector4

DXMath exposes a four-dimensional cross product as taking 3 inputs so it can compute a truly perpendicular Vector4 result
What is defined here is more akin to computing the 3D cross product and additionally multiplying the W components together, which while it makes sense for various scenarios, I'm no longer convinced Cross Product is what it should be called (especially when compared to what other similar math libraries expose/name).

@ghost ghost locked as resolved and limited conversation to collaborators Feb 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants