Created RVD #8

cseed · 2017-12-05T03:56:52Z

And related classes:

OrderedRDD2 => OrderedRVD
OrderedRDD2Type => OrderedRVType
OrderedrDD2Partitioner => OrderedRVPartitioner
PartitionKeyInfo2 => OrderedRVPartitionInfo

I moved those classes into is.hail.rvd. RVD and OrderedRVD don't extend RDD anymore. All map/mapPartitions functions take a suitable type.

Surprisingly few changes outside OrderedRDD/RVD. 98% of this is moving things around. (Sorry for lack of good diffs, I realize in retrospect I should have made the changes in place and kept the moves for a separate PR.)

I slightly strengthened the invariant of persist: persisting something already persisted at a new storage level throws an exception. I think everything else behaves the same.

@tpoterba FYI

wip. Compiling, untested.

AggregatorSuite still failing.

cseed · 2017-12-05T03:58:11Z

And FYI @danking if you have any feedback.

tpoterba · 2017-12-05T10:35:02Z

I'm not sure I totally like persist erroring. In particular, there are places where we should definitely persist key tables computed from VSMs automatically without the user needing to do so, and so this could cause a user to encounter a persist error without calling it twice themselves.

persist, unpersist never error out.

cseed · 2017-12-05T18:44:56Z

@tpoterba I addressed your feedback. persist and unpersist never error out.

jigold · 2017-12-05T16:36:24Z

src/main/scala/is/hail/rvd/RVD.scala

+  def apply(rowType: Type, rdd: RDD[RegionValue]): ConcreteRVD = new ConcreteRVD(rowType, rdd)
+}
+
+case class PersistedRVRDD(


Should this be PersistedRVD to be consistent? Same with the parameters persistedRDD and iterationRDD.

I agree the names aren't great, but I don't think it is right to call them RVDs, either. Neither this nor its parameters are RVDs: they are the RDD[RegionValue] constituents necessary to implement persisted RVDs. Also, I don't want to make these RVDs because this is uses for both persist on RVD and OrderedRVD.

jigold · 2017-12-05T16:40:12Z

src/main/scala/is/hail/rvd/RVD.scala

+    AggregateWithContext.aggregateWithContext(rdd)(context)(zeroValue)(seqOp, combOp)
+  }
+
+  def mapWithContext[C](newRowType: Type)(makeContext: () => C)(f: (C, RegionValue) => RegionValue) =


Can you keep all of the map functions together. Likewise with mapPartitions

jigold · 2017-12-05T16:41:22Z

src/main/scala/is/hail/rvd/RVD.scala

+
+  def count(): Long = rdd.count()
+
+  protected def persistRVRDD(level: StorageLevel): PersistedRVRDD = {


Same name question.

jigold · 2017-12-05T18:22:05Z

src/main/scala/is/hail/rvd/OrderedRVD.scala

+  }
+
+  def apply(typ: OrderedRVType, partitioner: OrderedRVPartitioner, rvd: RVD): OrderedRVD =
+    apply(typ, partitioner, rvd.rdd)


assert rvd.typ == typ.rowType

Good catch! Thanks.

jigold · 2017-12-05T18:44:54Z

src/main/scala/is/hail/rvd/OrderedRVD.scala

+    if (!shuffle && maxPartitions >= n)
+      return this
+    if (shuffle) {
+      val shuffled = rdd.coalesce(maxPartitions, shuffle = true)


Does this require rewriting like what was needed for persist? Could you please double check I rewrote coalesce properly?

Yes, I read over coalesce, it seemed correct to me.

cseed · 2017-12-05T19:00:24Z

Back to you. Addressed comments.

* # This is a combination of 22 commits. # This is the 1st commit message: apply resettable context forgot to fix one use of AutoCloseable fix add setup iterator more sensible method ordering make TrivialContext Resettable a few more missing resettablecontexts address comments apply resettable context forgot to fix one use of AutoCloseable fix add setup iterator more sensible method ordering remove rogue element type type make TrivialContext Resettable wip wip wip wip use safe row in join suite pull over hailcontext remove Region.clear(newEnd) add selectRegionValue # This is the commit message #2: convert relational.scala ; # This is the commit message #3: scope the extract aggregators constfb call # This is the commit message #4: scope interpret # This is the commit message #5: typeAfterSelect used by selectRegionValue # This is the commit message #6: load matrix # This is the commit message #7: imports # This is the commit message #8: loadbgen converted # This is the commit message #9: convert loadplink # This is the commit message hail-is#10: convert loadgdb # This is the commit message hail-is#11: convert loadvcf # This is the commit message hail-is#12: convert blockmatrix # This is the commit message hail-is#13: convert filterintervals # This is the commit message hail-is#14: convert ibd # This is the commit message hail-is#15: convert a few methods # This is the commit message hail-is#16: convert split multi # This is the commit message hail-is#17: convert VEP # This is the commit message hail-is#18: formatting fix # This is the commit message hail-is#19: add partitionBy and values # This is the commit message hail-is#20: fix bug in localkeysort # This is the commit message hail-is#21: fixup HailContext.readRowsPartition use # This is the commit message hail-is#22: port balding nichols model * apply resettable context forgot to fix one use of AutoCloseable fix add setup iterator more sensible method ordering make TrivialContext Resettable a few more missing resettablecontexts address comments apply resettable context forgot to fix one use of AutoCloseable fix add setup iterator more sensible method ordering remove rogue element type type make TrivialContext Resettable wip wip wip wip use safe row in join suite pull over hailcontext remove Region.clear(newEnd) add selectRegionValue convert relational.scala ; scope the extract aggregators constfb call scope interpret typeAfterSelect used by selectRegionValue load matrix imports loadbgen converted convert loadplink convert loadgdb convert loadvcf convert blockmatrix convert filterintervals convert ibd convert a few methods convert split multi convert VEP formatting fix add partitionBy and values fix bug in localkeysort fixup HailContext.readRowsPartition use port balding nichols model port over table.scala couple fixes convert matrix table remove necessary use of rdd variety of fixups wip add a clear * Remove direct Region allocation from FilterColsIR When regions are off-heap, we can allow the globals to live in a separate, longer-lived Region that is not cleared until the whole partition is finished. For now, we pay the memory cost. * Use RVDContext in MatrixRead zip This Region will get cleared by consumers. I introduced the zip primitive which is a safer way to zip two RVDs because it does not rely on the user correctly clearing the regions used by the left and right hand sides of the zip. * Control the Regions in LoadGDB I do not fully understand how LoadGDB is working, but a simple solution to the use-case is to serialize to arrays of bytes and parallelize those. I realize there is a proliferation of `coerce` methods. I plan to trim this down once we do not have RDD and ContextRDD coexisting * wip * unify RVD.run * reset in write * fixes * use context region when allocating * also read RVDs using RVDContext * formatting * address comments * remove unused val * abstract over boundary * little fixes * whoops forgot to clear before persisting This fixes the LDPrune if you dont clear the region things go wrong. Not sure what causes that bug. Maybe its something about encoders? * serialize for shuffles, region.scoped in matrixmapglobals, fix joins * clear more! * wip * wip * rework GeneralRDD to ease ContextRDD transition * formatting * final fixes * formatting * merge failures * more bad merge stuff * formatting * remove unnecessary stuff * remove fixme * boom! * variety of merge mistakes * fix destabilize bug * add missing newline * remember to clear the producer region in localkeysort * switch def to val * cleanup filteralleles and exporbidbimfam * fix clearing and serialization issue * fix BitPackedVectorView Previously it always assumed the variant struct started at offset zero, which is not true * address comments, remove a comment * remove direct use of Region * oops * werrrks, mebbe * needs cleanup * fix filter intervals * fixes * fixes * fix filterintervals * remove unnecessary copy in TableJoin * and finally fix the last test * re-use existing CodecSpec definition * remove unnecessary boundaries * use RVD abstraction when possible * formatting * bugfix: RegionValue must know its region * remove unnecessary val and comment * remove unused methods * eliminate unused constructors * undo debug change * formatting * remove unused imports * fix bug in tablejoin * fix RichRDDSuite test If you have no data, then you have no partitions, not 1 partition

cseed added 4 commits December 4, 2017 18:33

Created RVD.

791176b

wip. Compiling, untested.

Fixed serialization bug.

5148c4a

AggregatorSuite still failing.

Fixed another serialization bug.

994e55a

Fixed tests.

eab8120

Addressed feedback.

f0ae68e

persist, unpersist never error out.

jigold requested changes Dec 5, 2017

View reviewed changes

Addressed comments.

7d9cd84

jigold approved these changes Dec 5, 2017

View reviewed changes

jigold merged commit c54e3b7 into jigold:rdd2_ldprune Dec 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Created RVD #8

Created RVD #8

cseed commented Dec 5, 2017

cseed commented Dec 5, 2017

tpoterba commented Dec 5, 2017

cseed commented Dec 5, 2017

jigold Dec 5, 2017

cseed Dec 5, 2017

jigold Dec 5, 2017

jigold Dec 5, 2017

jigold Dec 5, 2017

cseed Dec 5, 2017

jigold Dec 5, 2017

cseed Dec 5, 2017

cseed commented Dec 5, 2017


		def count(): Long = rdd.count()

		protected def persistRVRDD(level: StorageLevel): PersistedRVRDD = {

Created RVD #8

Created RVD #8

Conversation

cseed commented Dec 5, 2017

cseed commented Dec 5, 2017

tpoterba commented Dec 5, 2017

cseed commented Dec 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cseed commented Dec 5, 2017