Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPlan] Perform interleaving as VPlan-to-VPlan transform. #94339

Closed
wants to merge 1 commit into from

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Jun 4, 2024

This patch implements explicit interleaving as VPlan transform, thus
simplifying VPTransform state (no need to store unrolled parts) as well
as recipe execution (no need to generate code for multiple parts in a
each recipe). It also allos for more general optimziations (e.g. avoid
generating code for recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

Depends on #93396.

@llvmbot
Copy link
Collaborator

llvmbot commented Jun 4, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch implements explicit interleaving as VPlan transform, thus
simplifying VPTransform state (no need to store unrolled parts) as well
as recipe execution (no need to generate code for multiple parts in a
each recipe). It also allos for more general optimziations (e.g. avoid
generating code for recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

Depends on #93396.


Patch is 307.22 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/94339.diff

43 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+9)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+184-277)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+84-86)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+70-56)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+3)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.h (+1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+364-431)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+440)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/fixed-order-recurrence.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-gather-scatter.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-live-out-pointer-induction.ll (+4-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-phi.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-accesses.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/uniform-load-store.ll (-3)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-interleave.ll (+6-7)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/epilog-vectorization-inductions.ll (+5-3)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/interleaving.ll (+34-34)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr47437.ll (+24-24)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr72969.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/uniform_mem_op.ll (+12-21)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains-vplan.ll (+12)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll (+15-15)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-complex.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll (+17)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+66-72)
  • (modified) llvm/test/Transforms/LoopVectorize/float-induction.ll (+13-15)
  • (modified) llvm/test/Transforms/LoopVectorize/induction.ll (+44-43)
  • (modified) llvm/test/Transforms/LoopVectorize/interleave-and-scalarize-only.ll (+3-2)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/pr45679-fold-tail-by-masking.ll (+12-12)
  • (modified) llvm/test/Transforms/LoopVectorize/reduction-inloop-uf4.ll (+60-60)
  • (modified) llvm/test/Transforms/LoopVectorize/scalable-inductions.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing-before-execute.ll (+13-6)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+2)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index c03c278fcebe7..61fc02cdf9b8b 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -161,6 +161,15 @@ class VPBuilder {
     return tryInsertInstruction(
         new VPInstruction(Opcode, Operands, WrapFlags, DL, Name));
   }
+
+  VPInstruction *createFPOp(unsigned Opcode,
+                            std::initializer_list<VPValue *> Operands,
+                            DebugLoc DL = {}, const Twine &Name = "",
+                            FastMathFlags FMFs = {}) {
+    auto *Op = new VPInstruction(Opcode, Operands, FMFs, DL, Name);
+    return tryInsertInstruction(Op);
+  }
+
   VPValue *createNot(VPValue *Operand, DebugLoc DL = {},
                      const Twine &Name = "") {
     return createInstruction(VPInstruction::Not, {Operand}, DL, Name);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 188bfc164f30a..8f21c42ee6e0e 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -551,8 +551,7 @@ class InnerLoopVectorizer {
   /// inclusive. Uses the VPValue operands from \p RepRecipe instead of \p
   /// Instr's operands.
   void scalarizeInstruction(const Instruction *Instr,
-                            VPReplicateRecipe *RepRecipe,
-                            const VPIteration &Instance,
+                            VPReplicateRecipe *RepRecipe, const VPLane &Lane,
                             VPTransformState &State);
 
   /// Try to vectorize interleaved access group \p Group with the base address
@@ -608,8 +607,7 @@ class InnerLoopVectorizer {
 
   /// Create the exit value of first order recurrences in the middle block and
   /// update their users.
-  void fixFixedOrderRecurrence(VPFirstOrderRecurrencePHIRecipe *PhiR,
-                               VPTransformState &State);
+  void fixFixedOrderRecurrence(VPLiveOut *LO, VPTransformState &State);
 
   /// Iteratively sink the scalarized operands of a predicated instruction into
   /// the block that was created for it.
@@ -2451,7 +2449,6 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
   auto *VecTy = VectorType::get(ScalarTy, VF * InterleaveFactor);
 
   // Prepare for the new pointers.
-  SmallVector<Value *, 2> AddrParts;
   unsigned Index = Group->getIndex(Instr);
 
   // TODO: extend the masked interleaved-group support to reversed access.
@@ -2474,40 +2471,37 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
   } else
     Idx = Builder.getInt32(-Index);
 
-  for (unsigned Part = 0; Part < UF; Part++) {
-    Value *AddrPart = State.get(Addr, VPIteration(Part, 0));
-    if (auto *I = dyn_cast<Instruction>(AddrPart))
-      State.setDebugLocFrom(I->getDebugLoc());
+  Value *AddrV = State.get(Addr, VPLane(0));
+  if (auto *I = dyn_cast<Instruction>(AddrV))
+    State.setDebugLocFrom(I->getDebugLoc());
 
-    // Notice current instruction could be any index. Need to adjust the address
-    // to the member of index 0.
-    //
-    // E.g.  a = A[i+1];     // Member of index 1 (Current instruction)
-    //       b = A[i];       // Member of index 0
-    // Current pointer is pointed to A[i+1], adjust it to A[i].
-    //
-    // E.g.  A[i+1] = a;     // Member of index 1
-    //       A[i]   = b;     // Member of index 0
-    //       A[i+2] = c;     // Member of index 2 (Current instruction)
-    // Current pointer is pointed to A[i+2], adjust it to A[i].
+  // Notice current instruction could be any index. Need to adjust the address
+  // to the member of index 0.
+  //
+  // E.g.  a = A[i+1];     // Member of index 1 (Current instruction)
+  //       b = A[i];       // Member of index 0
+  // Current pointer is pointed to A[i+1], adjust it to A[i].
+  //
+  // E.g.  A[i+1] = a;     // Member of index 1
+  //       A[i]   = b;     // Member of index 0
+  //       A[i+2] = c;     // Member of index 2 (Current instruction)
+  // Current pointer is pointed to A[i+2], adjust it to A[i].
 
-    bool InBounds = false;
-    if (auto *gep = dyn_cast<GetElementPtrInst>(AddrPart->stripPointerCasts()))
-      InBounds = gep->isInBounds();
-    AddrPart = Builder.CreateGEP(ScalarTy, AddrPart, Idx, "", InBounds);
-    AddrParts.push_back(AddrPart);
-  }
+  bool InBounds = false;
+  if (auto *gep = dyn_cast<GetElementPtrInst>(AddrV->stripPointerCasts()))
+    InBounds = gep->isInBounds();
+  AddrV = Builder.CreateGEP(ScalarTy, AddrV, Idx, "", InBounds);
 
   State.setDebugLocFrom(Instr->getDebugLoc());
   Value *PoisonVec = PoisonValue::get(VecTy);
 
-  auto CreateGroupMask = [this, &BlockInMask, &State, &InterleaveFactor](
-                             unsigned Part, Value *MaskForGaps) -> Value * {
+  auto CreateGroupMask = [this, &BlockInMask, &State,
+                          &InterleaveFactor](Value *MaskForGaps) -> Value * {
     if (VF.isScalable()) {
       assert(!MaskForGaps && "Interleaved groups with gaps are not supported.");
       assert(InterleaveFactor == 2 &&
              "Unsupported deinterleave factor for scalable vectors");
-      auto *BlockInMaskPart = State.get(BlockInMask, Part);
+      auto *BlockInMaskPart = State.get(BlockInMask);
       SmallVector<Value *, 2> Ops = {BlockInMaskPart, BlockInMaskPart};
       auto *MaskTy =
           VectorType::get(Builder.getInt1Ty(), VF.getKnownMinValue() * 2, true);
@@ -2518,7 +2512,7 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
     if (!BlockInMask)
       return MaskForGaps;
 
-    Value *BlockInMaskPart = State.get(BlockInMask, Part);
+    Value *BlockInMaskPart = State.get(BlockInMask);
     Value *ShuffledMask = Builder.CreateShuffleVector(
         BlockInMaskPart,
         createReplicatedMask(InterleaveFactor, VF.getKnownMinValue()),
@@ -2538,54 +2532,47 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
     }
 
     // For each unroll part, create a wide load for the group.
-    SmallVector<Value *, 2> NewLoads;
-    for (unsigned Part = 0; Part < UF; Part++) {
-      Instruction *NewLoad;
-      if (BlockInMask || MaskForGaps) {
-        assert(useMaskedInterleavedAccesses(*TTI) &&
-               "masked interleaved groups are not allowed.");
-        Value *GroupMask = CreateGroupMask(Part, MaskForGaps);
-        NewLoad =
-            Builder.CreateMaskedLoad(VecTy, AddrParts[Part], Group->getAlign(),
-                                     GroupMask, PoisonVec, "wide.masked.vec");
-      }
-      else
-        NewLoad = Builder.CreateAlignedLoad(VecTy, AddrParts[Part],
-                                            Group->getAlign(), "wide.vec");
-      Group->addMetadata(NewLoad);
-      NewLoads.push_back(NewLoad);
-    }
+    Instruction *NewLoad;
+    if (BlockInMask || MaskForGaps) {
+      assert(useMaskedInterleavedAccesses(*TTI) &&
+             "masked interleaved groups are not allowed.");
+      Value *GroupMask = CreateGroupMask(MaskForGaps);
+      NewLoad =
+          Builder.CreateMaskedLoad(VecTy, AddrV, Group->getAlign(), GroupMask,
+                                   PoisonVec, "wide.masked.vec");
+    } else
+      NewLoad = Builder.CreateAlignedLoad(VecTy, AddrV, Group->getAlign(),
+                                          "wide.vec");
+    Group->addMetadata(NewLoad);
 
     if (VecTy->isScalableTy()) {
       assert(InterleaveFactor == 2 &&
              "Unsupported deinterleave factor for scalable vectors");
 
-      for (unsigned Part = 0; Part < UF; ++Part) {
         // Scalable vectors cannot use arbitrary shufflevectors (only splats),
         // so must use intrinsics to deinterleave.
-        Value *DI = Builder.CreateIntrinsic(
-            Intrinsic::vector_deinterleave2, VecTy, NewLoads[Part],
-            /*FMFSource=*/nullptr, "strided.vec");
-        unsigned J = 0;
-        for (unsigned I = 0; I < InterleaveFactor; ++I) {
-          Instruction *Member = Group->getMember(I);
-
-          if (!Member)
-            continue;
+      Value *DI = Builder.CreateIntrinsic(Intrinsic::vector_deinterleave2,
+                                          VecTy, NewLoad,
+                                          /*FMFSource=*/nullptr, "strided.vec");
+      unsigned J = 0;
+      for (unsigned I = 0; I < InterleaveFactor; ++I) {
+        Instruction *Member = Group->getMember(I);
+
+        if (!Member)
+          continue;
 
-          Value *StridedVec = Builder.CreateExtractValue(DI, I);
-          // If this member has different type, cast the result type.
-          if (Member->getType() != ScalarTy) {
-            VectorType *OtherVTy = VectorType::get(Member->getType(), VF);
-            StridedVec = createBitOrPointerCast(StridedVec, OtherVTy, DL);
-          }
+        Value *StridedVec = Builder.CreateExtractValue(DI, I);
+        // If this member has different type, cast the result type.
+        if (Member->getType() != ScalarTy) {
+          VectorType *OtherVTy = VectorType::get(Member->getType(), VF);
+          StridedVec = createBitOrPointerCast(StridedVec, OtherVTy, DL);
+        }
 
-          if (Group->isReverse())
-            StridedVec = Builder.CreateVectorReverse(StridedVec, "reverse");
+        if (Group->isReverse())
+          StridedVec = Builder.CreateVectorReverse(StridedVec, "reverse");
 
-          State.set(VPDefs[J], StridedVec, Part);
-          ++J;
-        }
+        State.set(VPDefs[J], StridedVec);
+        ++J;
       }
 
       return;
@@ -2603,22 +2590,20 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
 
       auto StrideMask =
           createStrideMask(I, InterleaveFactor, VF.getKnownMinValue());
-      for (unsigned Part = 0; Part < UF; Part++) {
-        Value *StridedVec = Builder.CreateShuffleVector(
-            NewLoads[Part], StrideMask, "strided.vec");
-
-        // If this member has different type, cast the result type.
-        if (Member->getType() != ScalarTy) {
-          assert(!VF.isScalable() && "VF is assumed to be non scalable.");
-          VectorType *OtherVTy = VectorType::get(Member->getType(), VF);
-          StridedVec = createBitOrPointerCast(StridedVec, OtherVTy, DL);
-        }
+      Value *StridedVec =
+          Builder.CreateShuffleVector(NewLoad, StrideMask, "strided.vec");
+
+      // If this member has different type, cast the result type.
+      if (Member->getType() != ScalarTy) {
+        assert(!VF.isScalable() && "VF is assumed to be non scalable.");
+        VectorType *OtherVTy = VectorType::get(Member->getType(), VF);
+        StridedVec = createBitOrPointerCast(StridedVec, OtherVTy, DL);
+      }
 
-        if (Group->isReverse())
-          StridedVec = Builder.CreateVectorReverse(StridedVec, "reverse");
+      if (Group->isReverse())
+        StridedVec = Builder.CreateVectorReverse(StridedVec, "reverse");
 
-        State.set(VPDefs[J], StridedVec, Part);
-      }
+      State.set(VPDefs[J], StridedVec);
       ++J;
     }
     return;
@@ -2634,63 +2619,54 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
          "masked interleaved groups are not allowed.");
   assert((!MaskForGaps || !VF.isScalable()) &&
          "masking gaps for scalable vectors is not yet supported.");
-  for (unsigned Part = 0; Part < UF; Part++) {
-    // Collect the stored vector from each member.
-    SmallVector<Value *, 4> StoredVecs;
-    unsigned StoredIdx = 0;
-    for (unsigned i = 0; i < InterleaveFactor; i++) {
-      assert((Group->getMember(i) || MaskForGaps) &&
-             "Fail to get a member from an interleaved store group");
-      Instruction *Member = Group->getMember(i);
+  // Collect the stored vector from each member.
+  SmallVector<Value *, 4> StoredVecs;
+  unsigned StoredIdx = 0;
+  for (unsigned i = 0; i < InterleaveFactor; i++) {
+    assert((Group->getMember(i) || MaskForGaps) &&
+           "Fail to get a member from an interleaved store group");
+    Instruction *Member = Group->getMember(i);
 
-      // Skip the gaps in the group.
-      if (!Member) {
-        Value *Undef = PoisonValue::get(SubVT);
-        StoredVecs.push_back(Undef);
-        continue;
-      }
+    // Skip the gaps in the group.
+    if (!Member) {
+      Value *Undef = PoisonValue::get(SubVT);
+      StoredVecs.push_back(Undef);
+      continue;
+    }
 
-      Value *StoredVec = State.get(StoredValues[StoredIdx], Part);
-      ++StoredIdx;
+    Value *StoredVec = State.get(StoredValues[StoredIdx]);
+    ++StoredIdx;
 
-      if (Group->isReverse())
-        StoredVec = Builder.CreateVectorReverse(StoredVec, "reverse");
+    if (Group->isReverse())
+      StoredVec = Builder.CreateVectorReverse(StoredVec, "reverse");
 
-      // If this member has different type, cast it to a unified type.
+    // If this member has different type, cast it to a unified type.
 
-      if (StoredVec->getType() != SubVT)
-        StoredVec = createBitOrPointerCast(StoredVec, SubVT, DL);
+    if (StoredVec->getType() != SubVT)
+      StoredVec = createBitOrPointerCast(StoredVec, SubVT, DL);
 
-      StoredVecs.push_back(StoredVec);
-    }
+    StoredVecs.push_back(StoredVec);
+  }
 
-    // Interleave all the smaller vectors into one wider vector.
-    Value *IVec = interleaveVectors(Builder, StoredVecs, "interleaved.vec");
-    Instruction *NewStoreInstr;
-    if (BlockInMask || MaskForGaps) {
-      Value *GroupMask = CreateGroupMask(Part, MaskForGaps);
-      NewStoreInstr = Builder.CreateMaskedStore(IVec, AddrParts[Part],
-                                                Group->getAlign(), GroupMask);
-    } else
-      NewStoreInstr =
-          Builder.CreateAlignedStore(IVec, AddrParts[Part], Group->getAlign());
+  // Interleave all the smaller vectors into one wider vector.
+  Value *IVec = interleaveVectors(Builder, StoredVecs, "interleaved.vec");
+  Instruction *NewStoreInstr;
+  if (BlockInMask || MaskForGaps) {
+    Value *GroupMask = CreateGroupMask(MaskForGaps);
+    NewStoreInstr =
+        Builder.CreateMaskedStore(IVec, AddrV, Group->getAlign(), GroupMask);
+  } else
+    NewStoreInstr = Builder.CreateAlignedStore(IVec, AddrV, Group->getAlign());
 
-    Group->addMetadata(NewStoreInstr);
-  }
+  Group->addMetadata(NewStoreInstr);
 }
 
 void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,
                                                VPReplicateRecipe *RepRecipe,
-                                               const VPIteration &Instance,
+                                               const VPLane &Lane,
                                                VPTransformState &State) {
   assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
 
-  // llvm.experimental.noalias.scope.decl intrinsics must only be duplicated for
-  // the first lane and part.
-  if (isa<NoAliasScopeDeclInst>(Instr))
-    if (!Instance.isFirstIteration())
-      return;
-
   // Does this instruction return a value ?
   bool IsVoidRetTy = Instr->getType()->isVoidTy();
 
@@ -2713,18 +2689,18 @@ void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,
   // Replace the operands of the cloned instructions with their scalar
   // equivalents in the new loop.
   for (const auto &I : enumerate(RepRecipe->operands())) {
-    auto InputInstance = Instance;
+    auto InputLane = Lane;
     VPValue *Operand = I.value();
     if (vputils::isUniformAfterVectorization(Operand))
-      InputInstance.Lane = VPLane::getFirstLane();
-    Cloned->setOperand(I.index(), State.get(Operand, InputInstance));
+      InputLane = VPLane::getFirstLane();
+    Cloned->setOperand(I.index(), State.get(Operand, InputLane));
   }
   State.addNewMetadata(Cloned, Instr);
 
   // Place the cloned scalar in the new loop.
   State.Builder.Insert(Cloned);
 
-  State.set(RepRecipe, Cloned, Instance);
+  State.set(RepRecipe, Cloned, Lane);
 
   // If we just cloned a new assumption, add it the assumption cache.
   if (auto *II = dyn_cast<AssumeInst>(Cloned))
@@ -3251,7 +3227,7 @@ void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi,
       VPValue *StepVPV = Plan.getSCEVExpansion(II.getStep());
       assert(StepVPV && "step must have been expanded during VPlan execution");
       Value *Step = StepVPV->isLiveIn() ? StepVPV->getLiveInIRValue()
-                                        : State.get(StepVPV, {0, 0});
+                                        : State.get(StepVPV, VPLane(0));
       Value *Escape =
           emitTransformedIndex(B, CountMinusOne, II.getStartValue(), Step,
                                II.getKind(), II.getInductionBinOp());
@@ -3391,16 +3367,16 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
     fixNonInductionPHIs(Plan, State);
 
   // At this point every instruction in the original loop is widened to a
-  // vector form. Now we need to fix the recurrences in the loop. These PHI
-  // nodes are currently empty because we did not want to introduce cycles.
-  // This is the second stage of vectorizing recurrences. Note that fixing
-  // reduction phis are already modeled in VPlan.
-  // TODO: Also model fixing fixed-order recurrence phis in VPlan.
-  VPRegionBlock *VectorRegion = State.Plan->getVectorLoopRegion();
-  VPBasicBlock *HeaderVPBB = VectorRegion->getEntryBasicBlock();
-  for (VPRecipeBase &R : HeaderVPBB->phis()) {
-    if (auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&R))
-      fixFixedOrderRecurrence(FOR, State);
+  // vector form. Note that fixing reduction phis, as well as extracting the
+  // exit and resume values for fixed-order recurrences are already modeled in
+  // VPlan. All that remains to do here is creating a phi in the scalar
+  // pre-header for each fixed-rder recurrence resume value.
+  // TODO: Also model creating phis in the scalar pre-header in VPlan.
+  for (const auto &[_, LO] : to_vector(Plan.getLiveOuts())) {
+    if (!Legal->isFixedOrderRecurrence(LO->getPhi()))
+      continue;
+    fixFixedOrderRecurrence(LO, State);
+    Plan.removeLiveOut(LO->getPhi());
   }
 
   // Forget the original basic block.
@@ -3416,6 +3392,7 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
     for (PHINode &PN : Exit->phis())
       PSE.getSE()->forgetLcssaPhiWithNewPredecessor(OrigLoop, &PN);
 
+  VPRegionBlock *VectorRegion = State.Plan->getVectorLoopRegion();
   VPBasicBlock *LatchVPBB = VectorRegion->getExitingBasicBlock();
   Loop *VectorLoop = LI->getLoopFor(State.CFG.VPBB2IRBB[LatchVPBB]);
   if (Cost->requiresScalarEpilogue(VF.isVector())) {
@@ -3469,78 +3446,18 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
                                VF.getKnownMinValue() * UF);
 }
 
-void InnerLoopVectorizer::fixFixedOrderRecurrence(
-    VPFirstOrderRecurrencePHIRecipe *PhiR, VPTransformState &State) {
-  // This is the second phase of vectorizing first-order recurrences. An
-  // overview of the transformation is described below. Suppose we have the
-  // following loop.
-  //
-  //   for (int i = 0; i < n; ++i)
-  //     b[i] = a[i] - a[i - 1];
-  //
-  // There is a first-order recurrence on "a". For this loop, the shorthand
-  // scalar IR looks like:
-  //
-  //   scalar.ph:
-  //     s_init = a[-1]
-  //     br scalar.body
-  //
-  //   scalar.body:
-  //     i = phi [0, scalar.ph], [i+1, scalar.body]
-  //     s1 = phi [s_init, scalar.ph], [s2, scalar.body]
-  //     s2 = a[i]
-  //     b[i] = s2 - s1
-  //     br cond, scalar.body, ...
-  //
-  // In this example, s1 is a recurrence because it's value depends on the
-  // previous iteration. In the first phase of vectorization, we created a
-  // vector phi v1 for s1. We now complete the vectorization and produce the
-  // shorthand vector IR shown below (for VF = 4, UF = 1).
-  //
-  //   vector.ph:
-  //     v_init = vector(..., ..., ..., a[-1])
-  //     br vector.body
-  //
-  //   vector.body
-  //     i = phi [0, vector.ph], [i+4, vector.body]
-  //     v1 = phi [v_init, vector.ph], [v2, vector.body]
-  //     v2 = a[i, i+1, i+2, i+3];
-  //     v3 = vector(v1(3), v2(0, 1, 2))
-  //     b[i, i+1, i+2, i+3] = v2 - v3
-  //     br cond, vector.body, middle.block
-  //
-  //   middle.block:
-  //     x = v2(3)
-  //     br scalar.ph
-  //
-  //   scalar.ph:
-  //     s_init = phi [x, middle.block], [a[-1], otherwise]
-  //     br scalar.body
-  //
-  // After execution completes the vector loop, we extr...
[truncated]

Copy link

github-actions bot commented Jun 4, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 52d29eb2874580f0fe96e5cbb96faffbbdc432a7 e536698027f91df07eb4df9b933f451c01189ae3 -- llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h llvm/lib/Transforms/Vectorize/LoopVectorize.cpp llvm/lib/Transforms/Vectorize/VPlan.cpp llvm/lib/Transforms/Vectorize/VPlan.h llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp llvm/lib/Transforms/Vectorize/VPlanTransforms.h llvm/lib/Transforms/Vectorize/VPlanValue.h
View the diff from clang-format here.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 0d5318f4d2..8b69c7e7fb 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -9171,24 +9171,24 @@ void VPWidenPointerInductionRecipe::execute(VPTransformState &State) {
   // Create UF many actual address geps that use the pointer
   // phi as base and a vectorized version of the step value
   // (<step*0, ..., step*N>) as offset.
-    Type *VecPhiType = VectorType::get(PhiType, State.VF);
-    Value *StartOffsetScalar =
-        State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, Part));
-    Value *StartOffset =
-        State.Builder.CreateVectorSplat(State.VF, StartOffsetScalar);
-    // Create a vector of consecutive numbers from zero to VF.
-    StartOffset = State.Builder.CreateAdd(
-        StartOffset, State.Builder.CreateStepVector(VecPhiType));
-
-    assert(ScalarStepValue == State.get(getOperand(1), VPLane(0)) &&
-           "scalar step must be the same across all parts");
-    Value *GEP = State.Builder.CreateGEP(
-        State.Builder.getInt8Ty(), NewPointerPhi,
-        State.Builder.CreateMul(
-            StartOffset,
-            State.Builder.CreateVectorSplat(State.VF, ScalarStepValue),
-            "vector.gep"));
-    State.set(this, GEP, 0);
+  Type *VecPhiType = VectorType::get(PhiType, State.VF);
+  Value *StartOffsetScalar =
+      State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, Part));
+  Value *StartOffset =
+      State.Builder.CreateVectorSplat(State.VF, StartOffsetScalar);
+  // Create a vector of consecutive numbers from zero to VF.
+  StartOffset = State.Builder.CreateAdd(
+      StartOffset, State.Builder.CreateStepVector(VecPhiType));
+
+  assert(ScalarStepValue == State.get(getOperand(1), VPLane(0)) &&
+         "scalar step must be the same across all parts");
+  Value *GEP = State.Builder.CreateGEP(
+      State.Builder.getInt8Ty(), NewPointerPhi,
+      State.Builder.CreateMul(
+          StartOffset,
+          State.Builder.CreateVectorSplat(State.VF, ScalarStepValue),
+          "vector.gep"));
+  State.set(this, GEP, 0);
 }
 
 void VPDerivedIVRecipe::execute(VPTransformState &State) {
@@ -9268,14 +9268,14 @@ void VPWidenLoadRecipe::execute(VPTransformState &State) {
 
   auto &Builder = State.Builder;
   State.setDebugLocFrom(getDebugLoc());
-    Value *NewLI;
-    Value *Mask = nullptr;
-    if (auto *VPMask = getMask()) {
-      // Mask reversal is only needed for non-all-one (null) masks, as reverse
-      // of a null all-one mask is a null mask.
-      Mask = State.get(VPMask);
-      if (isReverse())
-        Mask = Builder.CreateVectorReverse(Mask, "reverse");
+  Value *NewLI;
+  Value *Mask = nullptr;
+  if (auto *VPMask = getMask()) {
+    // Mask reversal is only needed for non-all-one (null) masks, as reverse
+    // of a null all-one mask is a null mask.
+    Mask = State.get(VPMask);
+    if (isReverse())
+      Mask = Builder.CreateVectorReverse(Mask, "reverse");
     }
 
     Value *Addr = State.get(getAddr(), /*IsScalar*/ !CreateGather);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 0afb2c5884..c20f4c1fc7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -606,14 +606,14 @@ void VPInstruction::execute(VPTransformState &State) {
       (vputils::onlyFirstLaneUsed(this) || isVectorToScalar());
   bool GeneratesPerAllLanes = doesGeneratePerAllLanes();
   bool OnlyFirstPartUsed = vputils::onlyFirstPartUsed(this);
-    if (GeneratesPerAllLanes) {
-      for (unsigned Lane = 0, NumLanes = State.VF.getKnownMinValue();
-           Lane != NumLanes; ++Lane) {
-        Value *GeneratedValue = generatePerLane(State, Lane);
-        assert(GeneratedValue && "generatePerLane must produce a value");
-        State.set(this, GeneratedValue, VPLane(Lane));
-      }
-      return;
+  if (GeneratesPerAllLanes) {
+    for (unsigned Lane = 0, NumLanes = State.VF.getKnownMinValue();
+         Lane != NumLanes; ++Lane) {
+      Value *GeneratedValue = generatePerLane(State, Lane);
+      assert(GeneratedValue && "generatePerLane must produce a value");
+      State.set(this, GeneratedValue, VPLane(Lane));
+    }
+    return;
     }
 
     Value *GeneratedValue = generate(State);
@@ -1386,9 +1386,9 @@ void VPWidenGEPRecipe::execute(VPTransformState &State) {
     auto *NewGEP =
         State.Builder.CreateGEP(GEP->getSourceElementType(), Ops[0],
                                 ArrayRef(Ops).drop_front(), "", isInBounds());
-      Value *EntryPart = State.Builder.CreateVectorSplat(State.VF, NewGEP);
-      State.set(this, EntryPart, 0);
-      State.addMetadata(EntryPart, GEP);
+    Value *EntryPart = State.Builder.CreateVectorSplat(State.VF, NewGEP);
+    State.set(this, EntryPart, 0);
+    State.addMetadata(EntryPart, GEP);
   } else {
     // If the GEP has at least one loop-varying operand, we are sure to
     // produce a vector of pointers. But if we are only unrolling, we want
@@ -1397,8 +1397,8 @@ void VPWidenGEPRecipe::execute(VPTransformState &State) {
     // (otherwise). Note that for the unroll-only case, we still maintain
     // values in the vector mapping with initVector, as we do for other
     // instructions.
-      // The pointer operand of the new GEP. If it's loop-invariant, we
-      // won't broadcast it.
+    // The pointer operand of the new GEP. If it's loop-invariant, we
+    // won't broadcast it.
     auto *Ptr = isPointerLoopInvariant() ? State.get(getOperand(0), VPLane(0))
                                          : State.get(getOperand(0));
 

@ayalz
Copy link
Collaborator

ayalz commented Jun 5, 2024

Great to see this materialize, and the amount of clean-up it brings!

Can the patch be split into two main parts:

  1. introducing the VPlanTransform::interleave() pass, reusing existing code-gen with UF reset to 1
  2. retiring support for UF > 1 in code-gen, across execute()'s and VPTransformState?
    plus separating independent clang-format's.

Rebasing on top of #93396 will surely reduce the overall size, but enough would remain for such a breakdown to be useful.

This patch implements explicit interleaving as VPlan transform, thus
simplifying VPTransform state (no need to store unrolled parts) as well
as recipe execution (no need to generate code for multiple parts in a
each recipe). It also allos for more general optimziations (e.g. avoid
generating code for recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

Depends on llvm#93396.
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jun 17, 2024
This patch implements explicit interleaving as VPlan transform. In
follow up patches this will allow  simplifying VPTransform state
(no need to store unrolled parts) as well as recipe execution (no
need to generate code for multiple parts in a each recipe). It also
allows for more general optimziations (e.g. avoid generating code for
recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off  llvm#94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.
@fhahn
Copy link
Contributor Author

fhahn commented Jun 17, 2024

Great to see this materialize, and the amount of clean-up it brings!

Can the patch be split into two main parts:

  1. introducing the VPlanTransform::interleave() pass, reusing existing code-gen with UF reset to 1
  2. retiring support for UF > 1 in code-gen, across execute()'s and VPTransformState?
    plus separating independent clang-format's.

Split off adding the ::interleave part separately as suggested to #95842, which retains the existing structure in the recipes (iterate over parts) and doesn't simplify VPTransformState yet

fhahn added a commit to fhahn/llvm-project that referenced this pull request Jun 22, 2024
This patch implements explicit interleaving as VPlan transform. In
follow up patches this will allow  simplifying VPTransform state
(no need to store unrolled parts) as well as recipe execution (no
need to generate code for multiple parts in a each recipe). It also
allows for more general optimziations (e.g. avoid generating code for
recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off  llvm#94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.
fhahn added a commit to fhahn/llvm-project that referenced this pull request Aug 12, 2024
This patch implements explicit interleaving as VPlan transform. In
follow up patches this will allow  simplifying VPTransform state
(no need to store unrolled parts) as well as recipe execution (no
need to generate code for multiple parts in a each recipe). It also
allows for more general optimziations (e.g. avoid generating code for
recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off  llvm#94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.
fhahn added a commit to fhahn/llvm-project that referenced this pull request Aug 14, 2024
This patch implements explicit interleaving as VPlan transform. In
follow up patches this will allow  simplifying VPTransform state
(no need to store unrolled parts) as well as recipe execution (no
need to generate code for multiple parts in a each recipe). It also
allows for more general optimziations (e.g. avoid generating code for
recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off  llvm#94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.
fhahn added a commit that referenced this pull request Sep 21, 2024
This patch implements explicit unrolling by UF  as VPlan transform. In
follow up patches this will allow simplifying VPTransform state (no need
to store unrolled parts) as well as recipe execution (no need to
generate code for multiple parts in an each recipe). It also allows for
more general optimziations (e.g. avoid generating code for recipes that
are uniform-across parts).

It also unifies the logic dealing with unrolled parts in a single place,
rather than spreading it out across multiple places (e.g. VPlan post
processing for header-phi recipes previously.)

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off #94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.

A follow-up patch will clean up all references to VPTransformState::UF.

Another follow-up patch will simplify VPTransformState to only store a
single vector value per VPValue.

PR: #95842
@fhahn
Copy link
Contributor Author

fhahn commented Sep 24, 2024

Simplified initial version has landed as 8ec4067 (#95842) together with follow-up simplifications

06c3a7d
57f5d8f
f76dae1

@fhahn fhahn closed this Sep 24, 2024
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024
This patch implements explicit unrolling by UF  as VPlan transform. In
follow up patches this will allow simplifying VPTransform state (no need
to store unrolled parts) as well as recipe execution (no need to
generate code for multiple parts in an each recipe). It also allows for
more general optimziations (e.g. avoid generating code for recipes that
are uniform-across parts).

It also unifies the logic dealing with unrolled parts in a single place,
rather than spreading it out across multiple places (e.g. VPlan post
processing for header-phi recipes previously.)

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off llvm#94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.

A follow-up patch will clean up all references to VPTransformState::UF.

Another follow-up patch will simplify VPTransformState to only store a
single vector value per VPValue.

PR: llvm#95842
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
This patch implements explicit unrolling by UF  as VPlan transform. In
follow up patches this will allow simplifying VPTransform state (no need
to store unrolled parts) as well as recipe execution (no need to
generate code for multiple parts in an each recipe). It also allows for
more general optimziations (e.g. avoid generating code for recipes that
are uniform-across parts).

It also unifies the logic dealing with unrolled parts in a single place,
rather than spreading it out across multiple places (e.g. VPlan post
processing for header-phi recipes previously.)

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off llvm#94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.

A follow-up patch will clean up all references to VPTransformState::UF.

Another follow-up patch will simplify VPTransformState to only store a
single vector value per VPValue.

PR: llvm#95842
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants