Make Table.join use ordered joins on OrderedRVIterator #3159

patrick-schultz · 2018-03-15T19:34:31Z

This PR is to get Table.join to stop using Spark's join using Annotations, instead using ordered joins directly with region values.

Putting the new join into the IR will be a future PR.

I've given significant thought to how to make joins work making zero assumptions about partition keys. I'm convinced this is the right thing to do, and that with the right foundations it would just work without much extra code. I started to build those foundations in this PR, but some more work is needed to remove all partition key preconditions.

patrick-schultz · 2018-03-15T19:35:14Z

cc @cseed (say that three times fast!)

tpoterba

Some initial comments, mostly stylistic.

Something that makes me uncomfortable is letting the struct ExtendedOrdering compare values of different types (different lengths). It feels like we should be doing this by using physical types to produce the correct key for free, but we can't do this right now. What are your thoughts?

tpoterba · 2018-03-16T12:32:08Z

src/main/scala/is/hail/annotations/OrderedRVIterator.scala

@@ -5,6 +5,15 @@ import is.hail.utils._

 case class OrderedRVIterator(t: OrderedRVDType, iterator: Iterator[RegionValue]) {

+  def restrictToPKInterval(interval: Interval): Iterator[RegionValue] = {
+    val ur = new UnsafeRow(t.rowType, null, 0)


you can use the constructor that just takes a type here

tpoterba · 2018-03-16T12:34:50Z

src/main/scala/is/hail/rvd/OrderedRVD.scala

+        case "left" => _.leftJoin(_)
+        case "right" => _.rightJoin(_)
+        case "outer" => _.outerJoin(_)
+        case _ => fatal(


I think this can be removed, it's checked in Python.

can just make the match (@unchecked)

tpoterba · 2018-03-16T12:56:42Z

src/main/scala/is/hail/rvd/OrderedRVD.scala

+      joinType match {
+        case "inner" => _.innerJoinDistinct(_)
+        case "left" => _.leftJoinDistinct(_)
+        case _ => fatal(s"Unknown join type `$joinType'. Choose from `inner' or `left'.")


can also be unchecked, I think.

tpoterba · 2018-03-16T13:14:19Z

src/main/scala/is/hail/annotations/OrderedRVIterator.scala

+  def restrictToPKInterval(interval: Interval): Iterator[RegionValue] = {
+    val ur = new UnsafeRow(t.rowType, null, 0)
+    val pk = new KeyedRow(ur, t.kRowFieldIdx)
+    iterator.filter( rv => {


prefer curly braces around the entire thing instead of parens and curly braces together

tpoterba · 2018-03-16T13:15:27Z

src/main/scala/is/hail/annotations/UnsafeRow.scala

@@ -301,3 +301,14 @@ class UnsafeRow(var t: TBaseStruct,
    }
  }
 }
+
+class KeyedRow(var row: Row, keyFields: Array[Int]) extends Row {
+  def this(row: Row) = this(row, Array.range(0, row.size))


this constructor doesn't seem used, can we delete it?

Oh, yes. That was from when I didn't have KeyedRow extending Row.

tpoterba · 2018-03-16T13:40:17Z

src/main/scala/is/hail/sparkextras/RepartitionedOrderedRDD2.scala

+  * needed. No assumption should need to be made about partition keys, but currently
+  * assumes old partition key type is a prefix of the new partition key type.
+  */
+class RepartitionedOrderedRDD2(


RepartitionedOrderedRVD?

This is an RDD[RegionValue], not an RVD.

tpoterba · 2018-03-16T13:48:49Z

src/main/scala/is/hail/sparkextras/RepartitionedOrderedRDD2.scala

+
+case class RepartitionedOrderedRDD2Partition(
+    index: Int,
+    parents: Seq[Partition],


can this be an array? It's probably a List as implemented here.

I was copying from older code, and wasn't sure why some places used Array and some Seq. Array should be fine.

probably because Spark uses Seq everywhere :)

tpoterba · 2018-03-16T13:53:42Z

src/main/scala/is/hail/table/Table.scala

-      case "outer" => rddLeft.fullOuterJoin(rddRight).map { case (k, (l, r)) => merger(k, l.orNull, r.orNull) }
-      case _ => fatal("Invalid join type specified. Choose one of `left', `right', `inner', `outer'")
+    val left = this.rvd match {
+      case ordered: OrderedRVD => ordered


what about the case that it's ordered but keyed by the wrong thing? I'm having a bit of trouble reasoning about that case right now. Could it happen?

ah, keyBy does a horrible thing right now. so maybe it can't?

let's add an assertion, at least.

I think we need to decide what keys mean in different places. One thing I think would be reasonable is

Keys on Table, both in Python and in Scala, are purely metadata used to determine how to join.

Keys on OrderedRVD are a way of recording the ordering invariant satisfied by the underlying RDD.

If this is the case, then I need to rewrite this. When I wrote it, I think I was assuming the Table keys and the OrderedRVD keys had to be the same. Now I understand Tables are never ordered currently. We should probably make a reasonable choice of what keys mean and document it.

tpoterba · 2018-03-16T13:55:37Z

src/main/scala/is/hail/table/Table.scala

+      joinType,
+      rvMerger,
+      new OrderedRVDType(left.typ.partitionKey, left.typ.key, newSignature))
+    copy2(rvd = joinedRVD, signature = newSignature, key = key)


don't need the key=key I think

tpoterba · 2018-03-16T13:56:02Z

src/test/scala/is/hail/rvd/OrderedRVDPartitionerSuite.scala

+  }
+
+  // @Test def testGetPartitionPKWithSmallerKeys() {
+  //   assert(partitioner.getPartitionPK(Row(2)) == 0)


meant to be uncommented?

No, this is a test that I think should be made to pass, but currently doesn't. This is where I need the general comparison of arbitrary length tuples. I could remove this, or add a comment saying to enable the test when we can.

patrick-schultz · 2018-03-16T14:07:32Z

We should discuss the struct ordering in person. I think there are orderings that can be defined on the space of all tuples (since the names don't matter) of arbitrary lengths, which are very helpful in working with changing keys and partition keys. In principle, it should be easy to repartition an OrderedRVD with a longer partition key to a partitioner with a shorter partition key, but currently that doesn't look simple to do. I tried to lay the groundwork here to make that trivial.

patrick-schultz · 2018-03-16T16:44:38Z

Speaking of not understanding what keys mean, I found what looks to me like a bug, but I'm not sure. OrderedRVD.downcastToPK creates an OrderedRVD for which typ.kType is different from partitioner.kType. It's triggering the assert I made in RepartitionedOrderedRDD2 that says the new key must be a prefix of the old, to ensure that no sorting needs to be done.

I want to make join keys parameters of OrderedRVD.join, allowing them to be different from the partitioner keys. I was putting that off for a later PR, but now I think I might need to do that to fix this.

patrick-schultz · 2018-03-20T16:18:59Z

I addressed most of your comments. I also fixed the downcastToPK problem by getting rid of it, instead adding a KeyedOrderedRVD which has a join key in addition to an ordering key.

addressed

patrick-schultz · 2018-03-20T18:50:19Z

I moved table join to the IR, and it was such a tiny change I just added it to this PR. I can separate it back out if you'd prefer.

tpoterba

This looks great. A few discussion points to address, but feel free to push back on any or all.

tpoterba · 2018-03-20T20:57:54Z

src/main/scala/is/hail/rvd/KeyedOrderedRVD.scala

+import org.apache.spark.rdd.RDD
+import is.hail.utils.fatal
+
+class KeyedOrderedRVD(val rvd: OrderedRVD, val key: Array[String]) extends Serializable {


does this need to be serializable?

tpoterba · 2018-03-20T20:58:44Z

src/main/scala/is/hail/rvd/OrderedRVD.scala

+
+    require(ordType.rowType == typ.rowType)
+    require(ordType.kType isPrefixOf typ.kType)
+//    require(newPartitioner.kType isIsomorphicTo ordType.kType)


It's frustrating because this should be an invariant satisfied by all OrderedRVDs. But I didn't see any simple way to make it pass, and I'm planning on a refactoring that removes kType from OrderedRVDPartitioner, getting rid of the redundancy. I'll just delete this.

tpoterba · 2018-03-20T21:07:13Z

src/main/scala/is/hail/rvd/UnpartitionedRVD.scala

+      }
+    }
+
+    OrderedRVD.shuffle(ordType, newPartitioner, filtered)


we might want to consider using the coerce strategy with newPartitioner as a hint partitioner. Maybe not, though.

coerce felt too high level for this, but I'm open to discussion. This method says "give me back an OrderedRVD with exactly this partitioner, dropping any data that fall outside the given partitions." So the partitioner argument is more than a hint.

The other factor is that coerce does a pass over the data collecting statistics, and detecting if it was already ordered. My feeling was that this method should be an explicit "just do a shuffle". Probably collecting statistics and making choices based on the results should be explicit in the IR, not built into the methods the IR calls. Maybe @cseed has thoughts?

Works with arbitrary partition keys.

Addressed

* wip * wip * RepartionedOrderedRDD2 works * Improve joins code on OrderedRVD * wip on Table.join * Add RVD.constrainToOrderedPartitioner * Add KeyedRow * Make compile and pass tests * Generalize RepartitionedOrderedRDD2 Works with arbitrary partition keys. * Table.join works using ordered joins * Cleanup * Start writing OrderedRVDPartitionerSuite * Cleanup * Address comments * Make KeyedOrderedRVD * fix * Move Table.join to IR * whoops * Address comments

patrick-schultz assigned tpoterba Mar 15, 2018

tpoterba previously requested changes Mar 16, 2018

View reviewed changes

patrick-schultz force-pushed the ordered_repartition branch from 8fe3b57 to c3df4a8 Compare March 20, 2018 16:15

tpoterba previously requested changes Mar 20, 2018

View reviewed changes

patrick-schultz mentioned this pull request Mar 21, 2018

Eliminate some but not all uses of RVD.rdd #3186

Merged

patrick-schultz added 19 commits March 21, 2018 12:06

Add KeyedRow

5e82a3e

wip

d469ae0

wip

7f83e69

RepartionedOrderedRDD2 works

c00ce7b

Improve joins code on OrderedRVD

636b4ca

wip on Table.join

0fe54e9

Add RVD.constrainToOrderedPartitioner

4d5bf07

Make compile and pass tests

b8687a9

Generalize RepartitionedOrderedRDD2

cc4f8dc

Works with arbitrary partition keys.

Table.join works using ordered joins

a62ad04

Cleanup

3ae6d80

Start writing OrderedRVDPartitionerSuite

35875ce

Cleanup

b548c7e

Address comments

65b2c47

Make KeyedOrderedRVD

1e7500b

fix

f99fa7a

Move Table.join to IR

fc93d0c

whoops

f1472ad

Address comments

72fd435

patrick-schultz force-pushed the ordered_repartition branch from ae2272b to 72fd435 Compare March 21, 2018 16:32

tpoterba approved these changes Mar 21, 2018

View reviewed changes

tpoterba merged commit c1cc9ed into hail-is:master Mar 21, 2018

patrick-schultz deleted the ordered_repartition branch June 4, 2018 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Table.join use ordered joins on OrderedRVIterator #3159

Make Table.join use ordered joins on OrderedRVIterator #3159

patrick-schultz commented Mar 15, 2018

patrick-schultz commented Mar 15, 2018

tpoterba left a comment

tpoterba Mar 16, 2018

tpoterba Mar 16, 2018

tpoterba Mar 16, 2018

tpoterba Mar 16, 2018

tpoterba Mar 16, 2018

tpoterba Mar 16, 2018

patrick-schultz Mar 16, 2018

tpoterba Mar 16, 2018

patrick-schultz Mar 16, 2018

tpoterba Mar 16, 2018

patrick-schultz Mar 16, 2018

tpoterba Mar 16, 2018

tpoterba Mar 16, 2018

tpoterba Mar 16, 2018

tpoterba Mar 16, 2018

patrick-schultz Mar 16, 2018

tpoterba Mar 16, 2018

tpoterba Mar 16, 2018

patrick-schultz Mar 16, 2018

patrick-schultz commented Mar 16, 2018

patrick-schultz commented Mar 16, 2018

patrick-schultz commented Mar 20, 2018

patrick-schultz commented Mar 20, 2018

tpoterba left a comment

tpoterba Mar 20, 2018

tpoterba Mar 20, 2018

patrick-schultz Mar 21, 2018

tpoterba Mar 20, 2018

patrick-schultz Mar 21, 2018

Make Table.join use ordered joins on OrderedRVIterator #3159

Make Table.join use ordered joins on OrderedRVIterator #3159

Conversation

patrick-schultz commented Mar 15, 2018

patrick-schultz commented Mar 15, 2018

tpoterba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrick-schultz commented Mar 16, 2018

patrick-schultz commented Mar 16, 2018

patrick-schultz commented Mar 20, 2018

patrick-schultz commented Mar 20, 2018

tpoterba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment