Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APFloat] Add APFloat support for FP6 data types #94735

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

durga4github
Copy link
Contributor

This patch adds APFloat type support for two FP6 data types, E2M3 and E3M2.
The definitions for the two formats are detailed in section 5.3.2 of the
OCP specification, which can be accessed here:
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" llvm:support llvm:adt labels Jun 7, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Jun 7, 2024

@llvm/pr-subscribers-llvm-adt

@llvm/pr-subscribers-llvm-support

Author: Durgadoss R (durga4github)

Changes

This patch adds APFloat type support for two FP6 data types, E2M3 and E3M2.
The definitions for the two formats are detailed in section 5.3.2 of the
OCP specification, which can be accessed here:
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf


Patch is 34.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/94735.diff

4 Files Affected:

  • (modified) clang/lib/AST/MicrosoftMangle.cpp (+2)
  • (modified) llvm/include/llvm/ADT/APFloat.h (+14)
  • (modified) llvm/lib/Support/APFloat.cpp (+77-12)
  • (modified) llvm/unittests/ADT/APFloatTest.cpp (+474-22)
diff --git a/clang/lib/AST/MicrosoftMangle.cpp b/clang/lib/AST/MicrosoftMangle.cpp
index 36d611750ca48..72c79dab6bdcc 100644
--- a/clang/lib/AST/MicrosoftMangle.cpp
+++ b/clang/lib/AST/MicrosoftMangle.cpp
@@ -899,6 +899,8 @@ void MicrosoftCXXNameMangler::mangleFloat(llvm::APFloat Number) {
   case APFloat::S_Float8E4M3FNUZ:
   case APFloat::S_Float8E4M3B11FNUZ:
   case APFloat::S_FloatTF32:
+  case APFloat::S_Float6E3M2FN:
+  case APFloat::S_Float6E2M3FN:
     llvm_unreachable("Tried to mangle unexpected APFloat semantics");
   }
 
diff --git a/llvm/include/llvm/ADT/APFloat.h b/llvm/include/llvm/ADT/APFloat.h
index 44a301ecc9928..149b7a165c9d4 100644
--- a/llvm/include/llvm/ADT/APFloat.h
+++ b/llvm/include/llvm/ADT/APFloat.h
@@ -189,6 +189,14 @@ struct APFloatBase {
     // improved range compared to half (16-bit) formats, at (potentially)
     // greater throughput than single precision (32-bit) formats.
     S_FloatTF32,
+    // 6-bit floating point number with bit layout S1E3M2. Unlike IEEE-754
+    // types, there are no infinity or NaN values. The format is detailed in
+    // https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
+    S_Float6E3M2FN,
+    // 6-bit floating point number with bit layout S1E2M3. Unlike IEEE-754
+    // types, there are no infinity or NaN values. The format is detailed in
+    // https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
+    S_Float6E2M3FN,
 
     S_x87DoubleExtended,
     S_MaxSemantics = S_x87DoubleExtended,
@@ -209,6 +217,8 @@ struct APFloatBase {
   static const fltSemantics &Float8E4M3FNUZ() LLVM_READNONE;
   static const fltSemantics &Float8E4M3B11FNUZ() LLVM_READNONE;
   static const fltSemantics &FloatTF32() LLVM_READNONE;
+  static const fltSemantics &Float6E3M2FN() LLVM_READNONE;
+  static const fltSemantics &Float6E2M3FN() LLVM_READNONE;
   static const fltSemantics &x87DoubleExtended() LLVM_READNONE;
 
   /// A Pseudo fltsemantic used to construct APFloats that cannot conflict with
@@ -627,6 +637,8 @@ class IEEEFloat final : public APFloatBase {
   APInt convertFloat8E4M3FNUZAPFloatToAPInt() const;
   APInt convertFloat8E4M3B11FNUZAPFloatToAPInt() const;
   APInt convertFloatTF32APFloatToAPInt() const;
+  APInt convertFloat6E3M2FNAPFloatToAPInt() const;
+  APInt convertFloat6E2M3FNAPFloatToAPInt() const;
   void initFromAPInt(const fltSemantics *Sem, const APInt &api);
   template <const fltSemantics &S> void initFromIEEEAPInt(const APInt &api);
   void initFromHalfAPInt(const APInt &api);
@@ -642,6 +654,8 @@ class IEEEFloat final : public APFloatBase {
   void initFromFloat8E4M3FNUZAPInt(const APInt &api);
   void initFromFloat8E4M3B11FNUZAPInt(const APInt &api);
   void initFromFloatTF32APInt(const APInt &api);
+  void initFromFloat6E3M2FNAPInt(const APInt &api);
+  void initFromFloat6E2M3FNAPInt(const APInt &api);
 
   void assign(const IEEEFloat &);
   void copySignificand(const IEEEFloat &);
diff --git a/llvm/lib/Support/APFloat.cpp b/llvm/lib/Support/APFloat.cpp
index 283fcc153b33a..b8ca56d96efe4 100644
--- a/llvm/lib/Support/APFloat.cpp
+++ b/llvm/lib/Support/APFloat.cpp
@@ -68,6 +68,10 @@ enum class fltNonfiniteBehavior {
   // `fltNanEncoding` enum. We treat all NaNs as quiet, as the available
   // encodings do not distinguish between signalling and quiet NaN.
   NanOnly,
+
+  // This behavior is present in Float6E3M2FN and Float6E2M3FN types.
+  // There is no representation for Inf or NaN.
+  NoNanInf,
 };
 
 // How NaN values are represented. This is curently only used in combination
@@ -139,6 +143,10 @@ static constexpr fltSemantics semFloat8E4M3FNUZ = {
 static constexpr fltSemantics semFloat8E4M3B11FNUZ = {
     4, -10, 4, 8, fltNonfiniteBehavior::NanOnly, fltNanEncoding::NegativeZero};
 static constexpr fltSemantics semFloatTF32 = {127, -126, 11, 19};
+static constexpr fltSemantics semFloat6E3M2FN = {
+    4, -2, 3, 6, fltNonfiniteBehavior::NoNanInf};
+static constexpr fltSemantics semFloat6E2M3FN = {
+    2, 0, 4, 6, fltNonfiniteBehavior::NoNanInf};
 static constexpr fltSemantics semX87DoubleExtended = {16383, -16382, 64, 80};
 static constexpr fltSemantics semBogus = {0, 0, 0, 0};
 
@@ -206,6 +214,10 @@ const llvm::fltSemantics &APFloatBase::EnumToSemantics(Semantics S) {
     return Float8E4M3B11FNUZ();
   case S_FloatTF32:
     return FloatTF32();
+  case S_Float6E3M2FN:
+    return Float6E3M2FN();
+  case S_Float6E2M3FN:
+    return Float6E2M3FN();
   case S_x87DoubleExtended:
     return x87DoubleExtended();
   }
@@ -238,6 +250,10 @@ APFloatBase::SemanticsToEnum(const llvm::fltSemantics &Sem) {
     return S_Float8E4M3B11FNUZ;
   else if (&Sem == &llvm::APFloat::FloatTF32())
     return S_FloatTF32;
+  else if (&Sem == &llvm::APFloat::Float6E3M2FN())
+    return S_Float6E3M2FN;
+  else if (&Sem == &llvm::APFloat::Float6E2M3FN())
+    return S_Float6E2M3FN;
   else if (&Sem == &llvm::APFloat::x87DoubleExtended())
     return S_x87DoubleExtended;
   else
@@ -260,6 +276,8 @@ const fltSemantics &APFloatBase::Float8E4M3B11FNUZ() {
   return semFloat8E4M3B11FNUZ;
 }
 const fltSemantics &APFloatBase::FloatTF32() { return semFloatTF32; }
+const fltSemantics &APFloatBase::Float6E3M2FN() { return semFloat6E3M2FN; }
+const fltSemantics &APFloatBase::Float6E2M3FN() { return semFloat6E2M3FN; }
 const fltSemantics &APFloatBase::x87DoubleExtended() {
   return semX87DoubleExtended;
 }
@@ -878,6 +896,10 @@ void IEEEFloat::copySignificand(const IEEEFloat &rhs) {
    for the significand.  If double or longer, this is a signalling NaN,
    which may not be ideal.  If float, this is QNaN(0).  */
 void IEEEFloat::makeNaN(bool SNaN, bool Negative, const APInt *fill) {
+  if (semantics->nonFiniteBehavior == fltNonfiniteBehavior::NoNanInf) {
+    assert(false && "This floating point format does not support NaN\n");
+    return;
+  }
   category = fcNaN;
   sign = Negative;
   exponent = exponentNaN();
@@ -1499,16 +1521,18 @@ static void tcSetLeastSignificantBits(APInt::WordType *dst, unsigned parts,
 /* Handle overflow.  Sign is preserved.  We either become infinity or
    the largest finite number.  */
 IEEEFloat::opStatus IEEEFloat::handleOverflow(roundingMode rounding_mode) {
-  /* Infinity?  */
-  if (rounding_mode == rmNearestTiesToEven ||
-      rounding_mode == rmNearestTiesToAway ||
-      (rounding_mode == rmTowardPositive && !sign) ||
-      (rounding_mode == rmTowardNegative && sign)) {
-    if (semantics->nonFiniteBehavior == fltNonfiniteBehavior::NanOnly)
-      makeNaN(false, sign);
-    else
-      category = fcInfinity;
-    return (opStatus) (opOverflow | opInexact);
+  if (semantics->nonFiniteBehavior != fltNonfiniteBehavior::NoNanInf) {
+    /* Infinity?  */
+    if (rounding_mode == rmNearestTiesToEven ||
+        rounding_mode == rmNearestTiesToAway ||
+        (rounding_mode == rmTowardPositive && !sign) ||
+        (rounding_mode == rmTowardNegative && sign)) {
+      if (semantics->nonFiniteBehavior == fltNonfiniteBehavior::NanOnly)
+        makeNaN(false, sign);
+      else
+        category = fcInfinity;
+      return (opStatus) (opOverflow | opInexact);
+    }
   }
 
   /* Otherwise we become the largest finite number.  */
@@ -3518,13 +3542,17 @@ APInt IEEEFloat::convertIEEEFloatToAPInt() const {
     myexponent = ::exponentZero(S) + bias;
     mysignificand.fill(0);
   } else if (category == fcInfinity) {
-    if (S.nonFiniteBehavior == fltNonfiniteBehavior::NanOnly) {
+    if (S.nonFiniteBehavior == fltNonfiniteBehavior::NanOnly ||
+        S.nonFiniteBehavior == fltNonfiniteBehavior::NoNanInf) {
       llvm_unreachable("semantics don't support inf!");
     }
     myexponent = ::exponentInf(S) + bias;
     mysignificand.fill(0);
   } else {
     assert(category == fcNaN && "Unknown category!");
+    if (S.nonFiniteBehavior == fltNonfiniteBehavior::NoNanInf) {
+      llvm_unreachable("semantics don't support NaN!");
+    }
     myexponent = ::exponentNaN(S) + bias;
     std::copy_n(significandParts(), mysignificand.size(),
                 mysignificand.begin());
@@ -3605,6 +3633,16 @@ APInt IEEEFloat::convertFloatTF32APFloatToAPInt() const {
   return convertIEEEFloatToAPInt<semFloatTF32>();
 }
 
+APInt IEEEFloat::convertFloat6E3M2FNAPFloatToAPInt() const {
+  assert(partCount() == 1);
+  return convertIEEEFloatToAPInt<semFloat6E3M2FN>();
+}
+
+APInt IEEEFloat::convertFloat6E2M3FNAPFloatToAPInt() const {
+  assert(partCount() == 1);
+  return convertIEEEFloatToAPInt<semFloat6E2M3FN>();
+}
+
 // This function creates an APInt that is just a bit map of the floating
 // point constant as it would appear in memory.  It is not a conversion,
 // and treating the result as a normal integer is unlikely to be useful.
@@ -3646,6 +3684,12 @@ APInt IEEEFloat::bitcastToAPInt() const {
   if (semantics == (const llvm::fltSemantics *)&semFloatTF32)
     return convertFloatTF32APFloatToAPInt();
 
+  if (semantics == (const llvm::fltSemantics *)&semFloat6E3M2FN)
+    return convertFloat6E3M2FNAPFloatToAPInt();
+
+  if (semantics == (const llvm::fltSemantics *)&semFloat6E2M3FN)
+    return convertFloat6E2M3FNAPFloatToAPInt();
+
   assert(semantics == (const llvm::fltSemantics*)&semX87DoubleExtended &&
          "unknown format!");
   return convertF80LongDoubleAPFloatToAPInt();
@@ -3862,6 +3906,14 @@ void IEEEFloat::initFromFloatTF32APInt(const APInt &api) {
   initFromIEEEAPInt<semFloatTF32>(api);
 }
 
+void IEEEFloat::initFromFloat6E3M2FNAPInt(const APInt &api) {
+  initFromIEEEAPInt<semFloat6E3M2FN>(api);
+}
+
+void IEEEFloat::initFromFloat6E2M3FNAPInt(const APInt &api) {
+  initFromIEEEAPInt<semFloat6E2M3FN>(api);
+}
+
 /// Treat api as containing the bits of a floating point number.
 void IEEEFloat::initFromAPInt(const fltSemantics *Sem, const APInt &api) {
   assert(api.getBitWidth() == Sem->sizeInBits);
@@ -3891,6 +3943,10 @@ void IEEEFloat::initFromAPInt(const fltSemantics *Sem, const APInt &api) {
     return initFromFloat8E4M3B11FNUZAPInt(api);
   if (Sem == &semFloatTF32)
     return initFromFloatTF32APInt(api);
+  if (Sem == &semFloat6E3M2FN)
+    return initFromFloat6E3M2FNAPInt(api);
+  if (Sem == &semFloat6E2M3FN)
+    return initFromFloat6E2M3FNAPInt(api);
 
   llvm_unreachable(nullptr);
 }
@@ -4328,7 +4384,8 @@ int IEEEFloat::getExactLog2Abs() const {
 bool IEEEFloat::isSignaling() const {
   if (!isNaN())
     return false;
-  if (semantics->nonFiniteBehavior == fltNonfiniteBehavior::NanOnly)
+  if (semantics->nonFiniteBehavior == fltNonfiniteBehavior::NanOnly ||
+      semantics->nonFiniteBehavior == fltNonfiniteBehavior::NoNanInf)
     return false;
 
   // IEEE-754R 2008 6.2.1: A signaling NaN bit string should be encoded with the
@@ -4387,6 +4444,10 @@ IEEEFloat::opStatus IEEEFloat::next(bool nextDown) {
         // nextUp(getLargest()) == NAN
         makeNaN();
         break;
+      } else if (semantics->nonFiniteBehavior ==
+                 fltNonfiniteBehavior::NoNanInf) {
+        // nextUp(getLargest()) == getLargest()
+        break;
       } else {
         // nextUp(getLargest()) == INFINITY
         APInt::tcSet(significandParts(), 0, partCount());
@@ -4477,6 +4538,10 @@ APFloatBase::ExponentType IEEEFloat::exponentZero() const {
 }
 
 void IEEEFloat::makeInf(bool Negative) {
+  if (semantics->nonFiniteBehavior == fltNonfiniteBehavior::NoNanInf) {
+    assert(false && "This floating point format does not support Inf\n");
+    return;
+  }
   if (semantics->nonFiniteBehavior == fltNonfiniteBehavior::NanOnly) {
     // There is no Inf, so make NaN instead.
     makeNaN(false, Negative);
diff --git a/llvm/unittests/ADT/APFloatTest.cpp b/llvm/unittests/ADT/APFloatTest.cpp
index 6e4dda8351a1b..d80442b0661ee 100644
--- a/llvm/unittests/ADT/APFloatTest.cpp
+++ b/llvm/unittests/ADT/APFloatTest.cpp
@@ -47,6 +47,10 @@ static std::string convertToString(double d, unsigned Prec, unsigned Pad,
   return std::string(Buffer.data(), Buffer.size());
 }
 
+static bool hasNanOrInf(APFloat::Semantics S) {
+  return (S != APFloat::S_Float6E3M2FN) && (S != APFloat::S_Float6E2M3FN);
+}
+
 namespace {
 
 TEST(APFloatTest, isSignaling) {
@@ -723,11 +727,14 @@ TEST(APFloatTest, IsSmallestNormalized) {
     EXPECT_FALSE(APFloat::getZero(Semantics, false).isSmallestNormalized());
     EXPECT_FALSE(APFloat::getZero(Semantics, true).isSmallestNormalized());
 
-    EXPECT_FALSE(APFloat::getInf(Semantics, false).isSmallestNormalized());
-    EXPECT_FALSE(APFloat::getInf(Semantics, true).isSmallestNormalized());
+    if (hasNanOrInf(static_cast<APFloat::Semantics>(I))) {
+      EXPECT_FALSE(APFloat::getInf(Semantics, false).isSmallestNormalized());
+      EXPECT_FALSE(APFloat::getInf(Semantics, true).isSmallestNormalized());
+
+      EXPECT_FALSE(APFloat::getQNaN(Semantics).isSmallestNormalized());
+      EXPECT_FALSE(APFloat::getSNaN(Semantics).isSmallestNormalized());
+    }
 
-    EXPECT_FALSE(APFloat::getQNaN(Semantics).isSmallestNormalized());
-    EXPECT_FALSE(APFloat::getSNaN(Semantics).isSmallestNormalized());
 
     EXPECT_FALSE(APFloat::getLargest(Semantics).isSmallestNormalized());
     EXPECT_FALSE(APFloat::getLargest(Semantics, true).isSmallestNormalized());
@@ -1823,6 +1830,10 @@ TEST(APFloatTest, getLargest) {
       30, APFloat::getLargest(APFloat::Float8E4M3B11FNUZ()).convertToDouble());
   EXPECT_EQ(3.40116213421e+38f,
             APFloat::getLargest(APFloat::FloatTF32()).convertToFloat());
+  EXPECT_EQ(28,
+            APFloat::getLargest(APFloat::Float6E3M2FN()).convertToDouble());
+  EXPECT_EQ(7.5,
+            APFloat::getLargest(APFloat::Float6E2M3FN()).convertToDouble());
 }
 
 TEST(APFloatTest, getSmallest) {
@@ -1881,6 +1892,21 @@ TEST(APFloatTest, getSmallest) {
   EXPECT_TRUE(test.isFiniteNonZero());
   EXPECT_TRUE(test.isDenormal());
   EXPECT_TRUE(test.bitwiseIsEqual(expected));
+
+  test = APFloat::getSmallest(APFloat::Float6E3M2FN(), false);
+  expected = APFloat(APFloat::Float6E3M2FN(), "0x0.1p0");
+  EXPECT_FALSE(test.isNegative());
+  EXPECT_TRUE(test.isFiniteNonZero());
+  EXPECT_TRUE(test.isDenormal());
+  EXPECT_TRUE(test.bitwiseIsEqual(expected));
+
+  test = APFloat::getSmallest(APFloat::Float6E2M3FN(), false);
+  expected = APFloat(APFloat::Float6E2M3FN(), "0x0.2p0");
+  EXPECT_FALSE(test.isNegative());
+  EXPECT_TRUE(test.isFiniteNonZero());
+  EXPECT_TRUE(test.isDenormal());
+  EXPECT_TRUE(test.bitwiseIsEqual(expected));
+
 }
 
 TEST(APFloatTest, getSmallestNormalized) {
@@ -1963,6 +1989,22 @@ TEST(APFloatTest, getSmallestNormalized) {
   EXPECT_FALSE(test.isDenormal());
   EXPECT_TRUE(test.bitwiseIsEqual(expected));
   EXPECT_TRUE(test.isSmallestNormalized());
+  test = APFloat::getSmallestNormalized(APFloat::Float6E3M2FN(), false);
+  expected = APFloat(APFloat::Float6E3M2FN(), "0x1p-2");
+  EXPECT_FALSE(test.isNegative());
+  EXPECT_TRUE(test.isFiniteNonZero());
+  EXPECT_FALSE(test.isDenormal());
+  EXPECT_TRUE(test.bitwiseIsEqual(expected));
+  EXPECT_TRUE(test.isSmallestNormalized());
+
+  test = APFloat::getSmallestNormalized(APFloat::Float6E2M3FN(), false);
+  expected = APFloat(APFloat::Float6E2M3FN(), "0x1p0");
+  EXPECT_FALSE(test.isNegative());
+  EXPECT_TRUE(test.isFiniteNonZero());
+  EXPECT_FALSE(test.isDenormal());
+  EXPECT_TRUE(test.bitwiseIsEqual(expected));
+  EXPECT_TRUE(test.isSmallestNormalized());
+
 }
 
 TEST(APFloatTest, getZero) {
@@ -1996,7 +2038,11 @@ TEST(APFloatTest, getZero) {
       {&APFloat::Float8E4M3B11FNUZ(), false, false, {0, 0}, 1},
       {&APFloat::Float8E4M3B11FNUZ(), true, false, {0, 0}, 1},
       {&APFloat::FloatTF32(), false, true, {0, 0}, 1},
-      {&APFloat::FloatTF32(), true, true, {0x40000ULL, 0}, 1}};
+      {&APFloat::FloatTF32(), true, true, {0x40000ULL, 0}, 1},
+      {&APFloat::Float6E3M2FN(), false, true, {0, 0}, 1},
+      {&APFloat::Float6E3M2FN(), true, true, {0x20ULL, 0}, 1},
+      {&APFloat::Float6E2M3FN(), false, true, {0, 0}, 1},
+      {&APFloat::Float6E2M3FN(), true, true, {0x20ULL, 0}, 1}};
   const unsigned NumGetZeroTests = std::size(GetZeroTest);
   for (unsigned i = 0; i < NumGetZeroTests; ++i) {
     APFloat test = APFloat::getZero(*GetZeroTest[i].semantics,
@@ -5161,6 +5207,90 @@ TEST(APFloatTest, Float8ExhaustivePair) {
   }
 }
 
+TEST(APFloatTest, Float6ExhaustivePair) {
+  // Test each pair of 6-bit floats with non-standard semantics
+  for (APFloat::Semantics Sem :
+       {APFloat::S_Float6E3M2FN, APFloat::S_Float6E2M3FN}) {
+    const llvm::fltSemantics &S = APFloat::EnumToSemantics(Sem);
+    for (int i = 1; i < 64; i++) {
+      for (int j = 1; j < 64; j++) {
+        SCOPED_TRACE("sem=" + std::to_string(Sem) + ",i=" + std::to_string(i) +
+                     ",j=" + std::to_string(j));
+        APFloat x(S, APInt(6, i));
+        APFloat y(S, APInt(6, j));
+
+        bool losesInfo;
+        APFloat x16 = x;
+        x16.convert(APFloat::IEEEhalf(), APFloat::rmNearestTiesToEven,
+                    &losesInfo);
+        EXPECT_FALSE(losesInfo);
+        APFloat y16 = y;
+        y16.convert(APFloat::IEEEhalf(), APFloat::rmNearestTiesToEven,
+                    &losesInfo);
+        EXPECT_FALSE(losesInfo);
+
+        // Add
+        APFloat z = x;
+        z.add(y, APFloat::rmNearestTiesToEven);
+        APFloat z16 = x16;
+        z16.add(y16, APFloat::rmNearestTiesToEven);
+        z16.convert(S, APFloat::rmNearestTiesToEven, &losesInfo);
+        EXPECT_TRUE(z.bitwiseIsEqual(z16))
+            << "sem=" << Sem << ", i=" << i << ", j=" << j;
+
+        // Subtract
+        z = x;
+        z.subtract(y, APFloat::rmNearestTiesToEven);
+        z16 = x16;
+        z16.subtract(y16, APFloat::rmNearestTiesToEven);
+        z16.convert(S, APFloat::rmNearestTiesToEven, &losesInfo);
+        EXPECT_TRUE(z.bitwiseIsEqual(z16))
+            << "sem=" << Sem << ", i=" << i << ", j=" << j;
+
+        // Multiply
+        z = x;
+        z.multiply(y, APFloat::rmNearestTiesToEven);
+        z16 = x16;
+        z16.multiply(y16, APFloat::rmNearestTiesToEven);
+        z16.convert(S, APFloat::rmNearestTiesToEven, &losesInfo);
+        EXPECT_TRUE(z.bitwiseIsEqual(z16))
+            << "sem=" << Sem << ", i=" << i << ", j=" << j;
+
+        // Skip divide by 0
+        if (j == 0 || j == 32)
+          continue;
+
+        // Divide
+        z = x;
+        z.divide(y, APFloat::rmNearestTiesToEven);
+        z16 = x16;
+        z16.divide(y16, APFloat::rmNearestTiesToEven);
+        z16.convert(S, APFloat::rmNearestTiesToEven, &losesInfo);
+        EXPECT_TRUE(z.bitwiseIsEqual(z16))
+            << "sem=" << Sem << ", i=" << i << ", j=" << j;
+
+        // Mod
+        z = x;
+        z.mod(y);
+        z16 = x16;
+        z16.mod(y16);
+        z16.convert(S, APFloat::rmNearestTiesToEven, &losesInfo);
+        EXPECT_TRUE(z.bitwiseIsEqual(z16))
+            << "sem=" << Sem << ", i=" << i << ", j=" << j;
+
+        // Remainder
+        z = x;
+        z.remainder(y);
+        z16 = x16;
+        z16.remainder(y16);
+        z16.convert(S, APFloat::rmNearestTiesToEven, &losesInfo);
+        EXPECT_TRUE(z.bitwiseIsEqual(z16))
+            << "sem=" << Sem << ", i=" << i << ", j=" << j;
+      }
+    }
+  }
+}
+
 TEST(APFloatTest, ConvertE4M3FNToE5M2) {
   bool losesInfo;
   APFloat test(APFloat::Float8E4M3FN(), "1.0");
@@ -6620,28 +6750,39 @@ TEST(APFloatTest, getExactLog2) {
     EXPECT_EQ(INT_MIN, APFloat(Semantics, "-3.0").getExactLog2());
     EXPECT_EQ(INT_MIN, APFloat(Semantics, "3.0").getExactLog2Abs());
     EXPECT_EQ(INT_MIN, APFloat(Semantics, "-3.0").getExactLog2Abs());
-    EXPECT_EQ(3, APFloat(Semantics, "8.0").getExactLog2());
-    EXPECT_EQ(INT_MIN, APFloat(Semantics, "-8.0").getExactLog2());
-    EXPECT_EQ(-2, APFloat(Semantics, "0.25").getExactLog2());
-    EXPECT_EQ(-2, APFloat(Semantics, "0.25").getExactLog2Abs());
-    EXPECT_EQ(INT_MIN, APFloat(Semantics, "-0.25").getExactLog2());
-    EXPECT_EQ(-2, APFloat(Semantics, "-0.25").getExactLog2Abs());
-    EXPECT_EQ(3, APFloat(Semantics, "8.0").getExactLog2Abs());
-    EXPECT_EQ(3, APFloat(Semantics, "-8.0").getExactLog2Abs());
+
+    if (I == APFloat::S_Float6E2M3FN) {
+      EXPECT_EQ(2, APFloat(Semantics, "4.0").getExactLog2());
+      EXPECT_EQ(INT_MIN, APFloat(Semantics, "-4.0").getExactLog2());
+      EXPECT_EQ(2, APFloat(Semantics, "4.0").getExactLog2Abs());
+      EXPECT_EQ(2, APFloat(Semantics, "-4.0").getExactLog2Abs());
+    } else {
+      EXPECT_EQ(3, APFloat(Semantics, "8.0").getExactLog2());
+      EXPECT_EQ(INT_MIN, APFloat(Semantics, "-8.0").getExactLog2...
[truncated]

Copy link

github-actions bot commented Jun 7, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@durga4github durga4github force-pushed the durgadossr/fp6_apfloat_updates branch from 21432ae to ac137c5 Compare June 7, 2024 09:01
@durga4github
Copy link
Contributor Author

@ThomasRaoux , Could you please help review this change?

@tschuett tschuett added the floating-point Floating-point math label Jun 7, 2024
@arsenm arsenm requested a review from krzysz00 June 7, 2024 12:37
llvm/unittests/ADT/APFloatTest.cpp Outdated Show resolved Hide resolved
llvm/unittests/ADT/APFloatTest.cpp Show resolved Hide resolved
@durga4github durga4github force-pushed the durgadossr/fp6_apfloat_updates branch 2 times, most recently from 44b0572 to 3fd700c Compare June 7, 2024 14:02
llvm/lib/Support/APFloat.cpp Outdated Show resolved Hide resolved
llvm/lib/Support/APFloat.cpp Outdated Show resolved Hide resolved
llvm/lib/Support/APFloat.cpp Outdated Show resolved Hide resolved
llvm/lib/Support/APFloat.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me once the other comments are addressed. Thanks for the patch.

Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no issues with the code as written but I'm rather confused by how it will be used

What's the motivation for this PR? Will anyone be trying to constant-fold these things?

(If it's for MLIR support, I'd like to have a discussion there, since I don't thisk these scalars belong in FloatType and that the long-term solution is to allow arbitrary newtypes / tags on bits / ... up there)

llvm/lib/Support/APFloat.cpp Show resolved Hide resolved
@ThomasRaoux
Copy link
Contributor

(If it's for MLIR support, I'd like to have a discussion there, since I don't thisk these scalars belong in FloatType and that the long-term solution is to allow arbitrary newtypes / tags on bits / ... up there)

I don't want to speak for the author but I'm interested in this support. I replied on your thread on discourse:
https://discourse.llvm.org/t/rfc-should-the-ocp-microscaling-float-scalars-be-added-to-apfloat-and-floattype/77530/8

@durga4github durga4github force-pushed the durgadossr/fp6_apfloat_updates branch from 3fd700c to 94b25ae Compare June 7, 2024 19:09
Copy link
Contributor

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please wait to confirm that everybody is happy with the naming before merging.

@krzysz00 was your comment meant to be a blocker?

@krzysz00
Copy link
Contributor

krzysz00 commented Jun 7, 2024

@ThomasRaoux No, I just left a nitpick. I'm happy with the state of this.

This patch adds APFloat type support for two FP6
data types, E2M3 and E3M2. The definitions for the
two formats are detailed in section 5.3.2 of the
OCP specification, which can be accessed here:
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
@durga4github durga4github force-pushed the durgadossr/fp6_apfloat_updates branch from 94b25ae to 2ee1393 Compare June 8, 2024 10:12
@durga4github
Copy link
Contributor Author

There is one test failure in Codegen/LoongArch/opt-pipeline.ll and it does not seem related to my changes here.
So, merging this change.

@durga4github durga4github merged commit b1fe03f into llvm:main Jun 11, 2024
4 of 7 checks passed
Lukacma pushed a commit to Lukacma/llvm-project that referenced this pull request Jun 12, 2024
This patch adds APFloat type support for two FP6 data types,
E2M3 and E3M2. The definitions for the two formats are detailed
in section 5.3.2 of the OCP specification, which can be accessed here:

https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
@HerrCai0907 HerrCai0907 mentioned this pull request Jun 13, 2024
@durga4github durga4github deleted the durgadossr/fp6_apfloat_updates branch June 14, 2024 08:48
sergey-kozub added a commit to sergey-kozub/llvm-project that referenced this pull request Aug 27, 2024
`f6E3M2FN` type is proposed in [OpenCompute MX Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E3M2. Unlike IEEE-754 types, there are no infinity or NaN values.

```c
f6E3M2FN
- Exponent bias: 3
- Maximum stored exponent value: 7 (binary 111)
- Maximum unbiased exponent value: 7 - 3 = 4
- Minimum stored exponent value: 1 (binary 001)
- Minimum unbiased exponent value: 1 − 3 = −2
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.000.00
- Max normal number: S.111.11 = ±2^(4) x (1 + 0.75) = ±28
- Min normal number: S.001.00 = ±2^(-2) = ±0.25
- Max subnormal number: S.000.11 = ±2^(-2) x 0.75 = ±0.1875
- Min subnormal number: S.000.01 = ±2^(-2) x 0.25 = ±0.0625
```

Related PRs:
- [PR-94735](llvm#94735) [APFloat] Add APFloat support for FP6 data types
- [PR-97118](llvm#97118) [MLIR] Add f8E4M3 type - was used as a template for this PR
sergey-kozub added a commit to sergey-kozub/llvm-project that referenced this pull request Aug 28, 2024
`f6E3M2FN` type is proposed in [OpenCompute MX Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E3M2. Unlike IEEE-754 types, there are no infinity or NaN values.

```c
f6E3M2FN
- Exponent bias: 3
- Maximum stored exponent value: 7 (binary 111)
- Maximum unbiased exponent value: 7 - 3 = 4
- Minimum stored exponent value: 1 (binary 001)
- Minimum unbiased exponent value: 1 − 3 = −2
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.000.00
- Max normal number: S.111.11 = ±2^(4) x (1 + 0.75) = ±28
- Min normal number: S.001.00 = ±2^(-2) = ±0.25
- Max subnormal number: S.000.11 = ±2^(-2) x 0.75 = ±0.1875
- Min subnormal number: S.000.01 = ±2^(-2) x 0.25 = ±0.0625
```

Related PRs:
- [PR-94735](llvm#94735) [APFloat] Add APFloat support for FP6 data types
- [PR-97118](llvm#97118) [MLIR] Add f8E4M3 type - was used as a template for this PR
sergey-kozub added a commit to sergey-kozub/llvm-project that referenced this pull request Aug 28, 2024
`f6E3M2FN` type is proposed in [OpenCompute MX Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E3M2. Unlike IEEE-754 types, there are no infinity or NaN values.

```c
f6E3M2FN
- Exponent bias: 3
- Maximum stored exponent value: 7 (binary 111)
- Maximum unbiased exponent value: 7 - 3 = 4
- Minimum stored exponent value: 1 (binary 001)
- Minimum unbiased exponent value: 1 − 3 = −2
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.000.00
- Max normal number: S.111.11 = ±2^(4) x (1 + 0.75) = ±28
- Min normal number: S.001.00 = ±2^(-2) = ±0.25
- Max subnormal number: S.000.11 = ±2^(-2) x 0.75 = ±0.1875
- Min subnormal number: S.000.01 = ±2^(-2) x 0.25 = ±0.0625
```

Related PRs:
- [PR-94735](llvm#94735) [APFloat] Add APFloat support for FP6 data types
- [PR-97118](llvm#97118) [MLIR] Add f8E4M3 type - was used as a template for this PR
sergey-kozub added a commit to sergey-kozub/llvm-project that referenced this pull request Sep 5, 2024
`f6E3M2FN` type is proposed in [OpenCompute MX Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E3M2. Unlike IEEE-754 types, there are no infinity or NaN values.

```c
f6E3M2FN
- Exponent bias: 3
- Maximum stored exponent value: 7 (binary 111)
- Maximum unbiased exponent value: 7 - 3 = 4
- Minimum stored exponent value: 1 (binary 001)
- Minimum unbiased exponent value: 1 − 3 = −2
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.000.00
- Max normal number: S.111.11 = ±2^(4) x (1 + 0.75) = ±28
- Min normal number: S.001.00 = ±2^(-2) = ±0.25
- Max subnormal number: S.000.11 = ±2^(-2) x 0.75 = ±0.1875
- Min subnormal number: S.000.01 = ±2^(-2) x 0.25 = ±0.0625
```

Related PRs:
- [PR-94735](llvm#94735) [APFloat] Add APFloat support for FP6 data types
- [PR-97118](llvm#97118) [MLIR] Add f8E4M3 type - was used as a template for this PR
sergey-kozub added a commit that referenced this pull request Sep 10, 2024
This PR adds `f6E3M2FN` type to mlir.

`f6E3M2FN` type is proposed in [OpenCompute MX
Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E3M2. Unlike
IEEE-754 types, there are no infinity or NaN values.

```c
f6E3M2FN
- Exponent bias: 3
- Maximum stored exponent value: 7 (binary 111)
- Maximum unbiased exponent value: 7 - 3 = 4
- Minimum stored exponent value: 1 (binary 001)
- Minimum unbiased exponent value: 1 − 3 = −2
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.000.00
- Max normal number: S.111.11 = ±2^(4) x (1 + 0.75) = ±28
- Min normal number: S.001.00 = ±2^(-2) = ±0.25
- Max subnormal number: S.000.11 = ±2^(-2) x 0.75 = ±0.1875
- Min subnormal number: S.000.01 = ±2^(-2) x 0.25 = ±0.0625
```

Related PRs:
- [PR-94735](#94735) [APFloat]
Add APFloat support for FP6 data types
- [PR-97118](#97118) [MLIR] Add
f8E4M3 type - was used as a template for this PR
sergey-kozub added a commit to sergey-kozub/llvm-project that referenced this pull request Sep 10, 2024
This PR adds `f6E2M3FN` type to mlir.

`f6E2M3FN` type is proposed in [OpenCompute MX Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E2M3. Unlike IEEE-754 types, there are no infinity or NaN values.

```c
f6E2M3FN
- Exponent bias: 1
- Maximum stored exponent value: 3 (binary 11)
- Maximum unbiased exponent value: 3 - 1 = 2
- Minimum stored exponent value: 1 (binary 01)
- Minimum unbiased exponent value: 1 − 1 = 0
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.00.000
- Max normal number: S.11.111 = ±2^(2) x (1 + 0.875) = ±7.5
- Min normal number: S.01.000 = ±2^(0) = ±1.0
- Max subnormal number: S.00.111 = ±2^(0) x 0.875 = ±0.875
- Min subnormal number: S.00.001 = ±2^(0) x 0.125 = ±0.125
```

Related PRs:
- [PR-94735](llvm#94735) [APFloat] Add APFloat support for FP6 data types
- [PR-105573](llvm#105573) [MLIR] Add f6E3M2FN type - was used as a template for this PR
VitaNuo pushed a commit to VitaNuo/llvm-project that referenced this pull request Sep 12, 2024
This PR adds `f6E3M2FN` type to mlir.

`f6E3M2FN` type is proposed in [OpenCompute MX
Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E3M2. Unlike
IEEE-754 types, there are no infinity or NaN values.

```c
f6E3M2FN
- Exponent bias: 3
- Maximum stored exponent value: 7 (binary 111)
- Maximum unbiased exponent value: 7 - 3 = 4
- Minimum stored exponent value: 1 (binary 001)
- Minimum unbiased exponent value: 1 − 3 = −2
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.000.00
- Max normal number: S.111.11 = ±2^(4) x (1 + 0.75) = ±28
- Min normal number: S.001.00 = ±2^(-2) = ±0.25
- Max subnormal number: S.000.11 = ±2^(-2) x 0.75 = ±0.1875
- Min subnormal number: S.000.01 = ±2^(-2) x 0.25 = ±0.0625
```

Related PRs:
- [PR-94735](llvm#94735) [APFloat]
Add APFloat support for FP6 data types
- [PR-97118](llvm#97118) [MLIR] Add
f8E4M3 type - was used as a template for this PR
sergey-kozub added a commit to sergey-kozub/llvm-project that referenced this pull request Sep 16, 2024
This PR adds `f6E2M3FN` type to mlir.

`f6E2M3FN` type is proposed in [OpenCompute MX Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E2M3. Unlike IEEE-754 types, there are no infinity or NaN values.

```c
f6E2M3FN
- Exponent bias: 1
- Maximum stored exponent value: 3 (binary 11)
- Maximum unbiased exponent value: 3 - 1 = 2
- Minimum stored exponent value: 1 (binary 01)
- Minimum unbiased exponent value: 1 − 1 = 0
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.00.000
- Max normal number: S.11.111 = ±2^(2) x (1 + 0.875) = ±7.5
- Min normal number: S.01.000 = ±2^(0) = ±1.0
- Max subnormal number: S.00.111 = ±2^(0) x 0.875 = ±0.875
- Min subnormal number: S.00.001 = ±2^(0) x 0.125 = ±0.125
```

Related PRs:
- [PR-94735](llvm#94735) [APFloat] Add APFloat support for FP6 data types
- [PR-105573](llvm#105573) [MLIR] Add f6E3M2FN type - was used as a template for this PR
sergey-kozub added a commit that referenced this pull request Sep 16, 2024
This PR adds `f6E2M3FN` type to mlir.

`f6E2M3FN` type is proposed in [OpenCompute MX
Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E2M3. Unlike
IEEE-754 types, there are no infinity or NaN values.

```c
f6E2M3FN
- Exponent bias: 1
- Maximum stored exponent value: 3 (binary 11)
- Maximum unbiased exponent value: 3 - 1 = 2
- Minimum stored exponent value: 1 (binary 01)
- Minimum unbiased exponent value: 1 − 1 = 0
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.00.000
- Max normal number: S.11.111 = ±2^(2) x (1 + 0.875) = ±7.5
- Min normal number: S.01.000 = ±2^(0) = ±1.0
- Max subnormal number: S.00.111 = ±2^(0) x 0.875 = ±0.875
- Min subnormal number: S.00.001 = ±2^(0) x 0.125 = ±0.125
```

Related PRs:
- [PR-94735](#94735) [APFloat]
Add APFloat support for FP6 data types
- [PR-105573](#105573) [MLIR]
Add f6E3M2FN type - was used as a template for this PR
GleasonK pushed a commit to openxla/stablehlo that referenced this pull request Oct 23, 2024
…U) (#2581)

This is a proposal to add MX (microscaling) floating point types to
StableHLO.

Related links:
- StableHLO [PR#2582](#2582)
Add MX floating point types (f4E2M1FN, f6E2M3FN, f6E3M2FN, f8E8M0FNU)
- LLVM [PR#95392](llvm/llvm-project#95392)
[APFloat] Add APFloat support for FP4 data type
- LLVM [PR#94735](llvm/llvm-project#94735)
[APFloat] Add APFloat support for FP6 data types
- LLVM [PR#107127](llvm/llvm-project#107127)
[APFloat] Add APFloat support for E8M0 type
- LLVM [PR#108877](llvm/llvm-project#108877)
[MLIR] Add f4E2M1FN type
- LLVM [PR#107999](llvm/llvm-project#107999)
[MLIR] Add f6E2M3FN type
- LLVM [PR#105573](llvm/llvm-project#105573)
[MLIR] Add f6E3M2FN type
- LLVM [PR#111028](llvm/llvm-project#111028)
[MLIR] Add f8E8M0FNU type
- JAX-ML [PR#181](jax-ml/ml_dtypes#181) Add
sub-byte data types: float4_e2m1fn, float6_e2m3fn, float6_e3m2fn
- JAX-ML [PR#166](jax-ml/ml_dtypes#181) Add
float8_e8m0_fnu (E8M0) OCP MX scale format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category floating-point Floating-point math llvm:adt llvm:support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants