[compiler] Reorganize the compiler control flow to prepare for partition planning #12587

tpoterba · 2023-01-10T18:24:28Z

Evaluation of relational lets is an explicit pass.

Executing and rewriting shuffles is an explicit pass.

lowerDistributedSort executes the shuffle and produces a TableReader

Higher-level passes that recursively lower and execute are parameterized
by the contained pipeline.

patrick-schultz

Great change!

patrick-schultz · 2023-01-11T21:04:46Z

hail/src/main/scala/is/hail/expr/ir/Parser.scala

@@ -1563,7 +1563,6 @@ object IRParser {
        table_ir(env.onlyRelational)(it).map { child =>
          TableKeyBy(child, keys, isSorted)
        }
-      case "TableDistinct" => table_ir(env.onlyRelational)(it).map(TableDistinct)


Could you explain this deletion?

This is explained by an overeager finger on the delete key when I backed out the TableStrictify node :D

patrick-schultz · 2023-01-12T14:14:01Z

hail/src/main/scala/is/hail/expr/ir/lowering/LowerDistributedSort.scala

+    globals.typ.asInstanceOf[TStruct]
+  )
+
+  override def pathsUsed: Seq[String] = Seq()


Should this be the files in orderedOutputPartitions?

Short answer -- maybe

longer answer -- I don't think this really matters, pathsUsed is used to try to catch when users read/write to the same table file. Anything generated in the compiler that's not user-exposed doesn't actually need to define stuff here. I think the most correct answer for pathsUsed would be the root directory of all the output partitions (rather than each file individually). Should probably rename pathsUsed to userPathsUsed

patrick-schultz · 2023-01-12T14:21:57Z

hail/src/main/scala/is/hail/expr/ir/lowering/LowerTableIR.scala

+  // change 1 - contexts is now not known until the partitioner is passed down
+  //


This looks like wip notes?

patrick-schultz · 2023-01-12T14:30:17Z

hail/src/main/scala/is/hail/rvd/RVDPartitioner.scala

+  def strictifyGeneric(allowedOverlap: Int): RVDPartitioner = {
+    if (satisfiesAllowedOverlap(allowedOverlap))
+      this
+    else
+      coarsen(allowedOverlap+1)
+        .strictify
+        .extendKey(kType)
+  }
  def strictify: RVDPartitioner = extendKey(kType)


You can simplify this into a single function

def strictify(allowedOverlap: Int = kType.size - 1): RVDPartitioner = { if (satisfiesAllowedOverlap(allowedOverlap)) this else coarsen(allowedOverlap+1).extendKey(kType) }

Also note, re. my colocalizedKey comment, that this does ensure that strictify(kType.size) is always a no-op.

patrick-schultz · 2023-01-12T14:42:15Z

hail/src/main/scala/is/hail/expr/ir/lowering/LowerTableIR.scala

@@ -1475,18 +1404,11 @@ object LowerTableIR {
          s"isSorted=${isSorted}, nPresFields=${nPreservedFields}, newKey=${newKey}, " +
            s"originalKey = ${loweredChild.kType.fieldNames.toSeq}, child key=${child.typ.keyType}")

-        if (nPreservedFields == newKey.length || isSorted)
+        require(nPreservedFields == newKey.length || isSorted)


Should probably just use definitelyDoesNotShuffle

patrick-schultz · 2023-01-12T14:46:07Z

hail/src/main/scala/is/hail/expr/ir/TableIR.scala

@@ -3213,6 +3223,8 @@ object TableOrderBy {
 }

 case class TableOrderBy(child: TableIR, sortFields: IndexedSeq[SortField]) extends TableIR {
+
+  lazy val definitelyDoesNotShuffle: Boolean = sortFields.forall(_.sortOrder == Ascending) && child.typ.key.startsWith(sortFields.map(_.field))


Could use TableOrderBy.isAlreadyOrdered to avoid duplicating logic.

patrick-schultz · 2023-01-12T14:46:49Z

hail/src/main/scala/is/hail/expr/ir/lowering/LowerTableIR.scala

-          ctx.backend.lowerDistributedSort(
-            ctx, loweredChild, sortFields, relationalLetsAbove, rowRType)
-        }
+        require(TableOrderBy.isAlreadyOrdered(sortFields, loweredChild.partitioner.kType.fieldNames))


Likewise should just use definitelyDoesNotShuffle

patrick-schultz · 2023-01-12T14:53:17Z

hail/src/main/scala/is/hail/expr/ir/lowering/LowerTableIR.scala

+      case TableMapPartitions(child, globalName, partitionStreamName, body, colocalizedKey) =>
+        val loweredChild = {
+          val lc = lower(child)
+          colocalizedKey match {
+            case Some(k) => lc.strictify(k)
+            case None => lc
+          }
+        }


If colocalizedKey is used as the allowedOverlap, then I think the name is misleading, and should be something like allowedKeyOverlap. Also, it doesn't need to be optional, the None case is the same as allowedKeyOverlap = keyType.length.

tpoterba · 2023-01-27T15:53:18Z

Made almost all your suggested changes. I left the allowedOverlap on TMP as an option, because I think allowedOverlap == ktype.size actually does indicate that the consumer requires a keyed (sorted) input, and the None case doesn't.

patrick-schultz · 2023-01-27T16:43:31Z

I think allowedOverlap == ktype.size actually does indicate that the consumer requires a keyed (sorted) input

It doesn't. I think we rarely if ever need to treat unkeyed tables as a special case.

Unkeyed tables are precisely the case where ktype.size == allowedOverlap == 0. E.g. see RVDPartitioner.unkeyed. So for TableMapPartitions on unkeyed tables, allowedOverlap = 0 means normal map partitions, and allowedOverlap = -1 means it needs to process all rows in one partition, which is consistent with the general case where allowedOverlap = ktype.size means normal map partitions and allowedOverlap = ktype.size - 1 means group all equal keys together. (We don't actually allow allowedOverlap = -1, but if we did it would always mean use a single partition.)

tpoterba · 2023-01-27T17:26:26Z

Here's a case I'm worried about --

I have TableAggregate(TableMapPartitions(child)). the key of child is locus.

The TableAggregate doesn't care about keyed input (commutative aggregator or something)

My TMP partition function requires its input to be sorted by locus, but allows overlap between partitions.

How do I use a single non-optional allowedOverlap to express this versus a TMP function that doesn't care about keying/sorting at all?

patrick-schultz · 2023-01-27T22:09:10Z

How do I use a single non-optional allowedOverlap to express this versus a TMP function that doesn't care about keying/sorting at all?

Ah, I see. I think the answer is: you can't. You would need another integer parameter to say "I actually depend on this prefix of the key being sorted". It seems like allowedOverlap and requiredSortedPrefix are completely independent. In the single key case (for simplicity), you may or may not care if keys are localized in one partition, and you may or may not care if they're sorted. I don't see any connection.

x

patrick-schultz · 2023-03-02T16:31:42Z

hail/python/hail/table.py

@@ -3834,7 +3834,7 @@ def _map_partitions(self, f):
            raise ValueError('Table._map_partitions must preserve key fields')

        body_ir = ir.Let('global', ir.Ref(globals_uid, self._global_type), body._ir)
-        return Table(ir.TableMapPartitions(self._tir, globals_uid, rows_uid, body_ir))
+        return Table(ir.TableMapPartitions(self._tir, globals_uid, rows_uid, body_ir, 0, len(self.key)))


If I remember the semantics of requested_key right, doesn't this need to assume the body might depend on the entire key?

Yes, that's right. I think that's the safe assumption. We can add these as params to the method in the future.

patrick-schultz · 2023-03-02T16:33:25Z

hail/src/main/scala/is/hail/HailContext.scala

-import is.hail.expr.ir.BaseIR
+import is.hail.expr.ir.{BaseIR, LoweringAnalyses, TableIR}
 import is.hail.expr.ir.functions.IRFunctionRegistry
+import is.hail.expr.ir.lowering.{CanLowerEfficiently, DArrayLowering, LowerTableIR, TableStage}


I assume these are unused, since there are no other changes?

patrick-schultz · 2023-03-02T17:55:53Z

hail/src/main/scala/is/hail/expr/ir/lowering/LowerTableIR.scala

-                  }
-                }
+          bindIR(invoke("extend", TArray(TInt32), ToArray(mapIR(rangeIR(nPartitionsAdj)) { partIdx =>
+            invoke("ceil", TFloat64, partIdx.toD * numRowsRef.toD / nPartitionsAdj.toDouble).toI


Was there a reason this can't be (partIdx * numRowsRef) floorDiv nPartitionsAdj?

I think that works, didn't go the final step in simplifying after rewriting. 👍

patrick-schultz · 2023-03-02T19:06:34Z

hail/src/main/scala/is/hail/rvd/RVDPartitioner.scala

@@ -69,7 +69,7 @@ class RVDPartitioner(
      Some(Interval(rangeBounds.head.left, rangeBounds.last.right))

  def satisfiesAllowedOverlap(testAllowedOverlap: Int): Boolean =
-    RVDPartitioner.isValid(sm, kType, rangeBounds, testAllowedOverlap)
+    (testAllowedOverlap >= kType.size) || RVDPartitioner.isValid(sm, kType, rangeBounds, testAllowedOverlap)


Could we move the guard into RVDPartitioner.isValid?

yeah, good change.

addressed, thanks

…ion planning Evaluation of relational lets is an explicit pass. Executing and rewriting shuffles is an explicit pass. * lowerDistributedSort executes the shuffle and produces a TableReader Higher-level passes that recursively lower and execute are parameterized by the contained pipeline. fix EvalRelationalLets fix some tests update TMP and some CanLowerEfficiently fixes fix fixes colocalized_key => allowed_overlap address comments TMP change again

ehigham · 2023-03-16T15:19:51Z

hail/src/main/scala/is/hail/types/TypeWithRequiredness.scala

 object RTuple {
-  def apply(fields: Seq[TypeWithRequiredness]): RTuple =
-    RTuple(Array.tabulate(fields.length)(i => RField(i.toString, fields(i), i)))
+  def apply(fields: Seq[(Int, TypeWithRequiredness)]): RTuple =
+    RTuple(fields.zipWithIndex.map { case ((fdIdx, typ), i) => RField(fdIdx.toString, typ, i) }.toIndexedSeq)
 }


tpoterba added the stacked PR label Jan 10, 2023

tpoterba assigned patrick-schultz Jan 10, 2023

patrick-schultz previously requested changes Jan 12, 2023

View reviewed changes

tpoterba removed the stacked PR label Jan 12, 2023

tpoterba force-pushed the compiler-reorg-1 branch from d2a6e4f to 3bfb13f Compare January 27, 2023 15:28

tpoterba force-pushed the compiler-reorg-1 branch 2 times, most recently from d13e753 to 07b2564 Compare February 7, 2023 18:54

tpoterba force-pushed the compiler-reorg-1 branch 4 times, most recently from d394535 to 61922ba Compare February 17, 2023 22:06

tpoterba force-pushed the compiler-reorg-1 branch 2 times, most recently from c90e9de to 6f06ac2 Compare February 27, 2023 15:43

patrick-schultz previously requested changes Mar 2, 2023

View reviewed changes

patrick-schultz approved these changes Mar 4, 2023

View reviewed changes

tpoterba force-pushed the compiler-reorg-1 branch 2 times, most recently from d9f1d99 to a880a90 Compare March 14, 2023 16:21

tpoterba added 7 commits March 15, 2023 09:56

fixes to TMP

2a2a752

fix

0f16fff

fix

152a37b

yay

e9abb2b

fix

d4425c6

fix

f0598f6

tpoterba added 27 commits March 15, 2023 09:57

fix tuples

a51ed31

fixes to lowering vs execute

f3f37d3

fixes

b1617e1

fix TMP in agg

047a87a

rebase

0e603d9

move repartition check up

7bf62a7

fix lowering flow

a7052c9

fix rebase

359d848

fixes

44b92e7

darray lowering

3a09374

more fixes

341dac3

fix tail

56c4fa7

fix subset

24318cd

fixes

80b8a06

delete stagesuite

2f08005

fix

0b9bb5c

fixes

03ee403

ugh

764c70f

better ci error

52b432e

fix prune and sjavaarray

6a76b16

fix union

81579e9

fix up rebase

e9be6df

remove comment and debug

eb8bf8d

back out isvalid change

1872b3f

fix tablegen tests

1a38d37

fixes

04eeac3

fix up rebase

7118893

tpoterba force-pushed the compiler-reorg-1 branch from a880a90 to 7118893 Compare March 15, 2023 14:19

danking merged commit 87d3acd into hail-is:main Mar 15, 2023

ehigham reviewed Mar 16, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[compiler] Reorganize the compiler control flow to prepare for partition planning #12587

[compiler] Reorganize the compiler control flow to prepare for partition planning #12587

tpoterba commented Jan 10, 2023

patrick-schultz left a comment

patrick-schultz Jan 11, 2023

tpoterba Jan 12, 2023

patrick-schultz Jan 12, 2023

tpoterba Jan 12, 2023

patrick-schultz Jan 12, 2023

patrick-schultz Jan 12, 2023

patrick-schultz Jan 12, 2023

patrick-schultz Jan 12, 2023

patrick-schultz Jan 12, 2023

patrick-schultz Jan 12, 2023

tpoterba commented Jan 27, 2023

patrick-schultz commented Jan 27, 2023

tpoterba commented Jan 27, 2023

patrick-schultz commented Jan 27, 2023

patrick-schultz Mar 2, 2023

tpoterba Mar 3, 2023

patrick-schultz Mar 2, 2023

patrick-schultz Mar 2, 2023

tpoterba Mar 3, 2023

patrick-schultz Mar 2, 2023

tpoterba Mar 3, 2023

ehigham Mar 16, 2023

		// change 1 - contexts is now not known until the partitioner is passed down
		//

[compiler] Reorganize the compiler control flow to prepare for partition planning #12587

[compiler] Reorganize the compiler control flow to prepare for partition planning #12587

Conversation

tpoterba commented Jan 10, 2023

patrick-schultz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tpoterba commented Jan 27, 2023

patrick-schultz commented Jan 27, 2023

tpoterba commented Jan 27, 2023

patrick-schultz commented Jan 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment