Efficient Joins and (re)Partitioning #1324

devin-petersohn · 2016-12-23T16:59:20Z

Ready for review.

AmplabJenkins · 2016-12-23T17:03:01Z

Can one of the admins verify this patch?

fnothaft · 2016-12-23T17:04:02Z

Jenkins, add to whitelist.

AmplabJenkins · 2016-12-23T17:11:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1700/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 1bbc8ab # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 1bbc8ab (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1bbc8abae508530dab55ae88be11935a74b594e0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

devin-petersohn · 2016-12-23T17:43:17Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/SortedGenomicRDDSuite.scala

+      val i = z.leftOuterShuffleRegionJoin(x).rdd.collect
+      assert(h.length == i.length)
+
+      val t = sc.loadParquetAlignments("/Users/DevinPetersohn/software_builds/adam/adam-core/src/test/resources/sortedAlignments.parquet.txt")


I will fix this path.

devin-petersohn · 2016-12-23T17:43:26Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/SortedGenomicRDDSuite.scala

+  sparkTest("testing partitioner") {
+    time {
+      //val x = sc.loadBam("/data/recompute/alignments/NA12878.bam.aln.bam")
+      val x = sc.loadBam("/Users/DevinPetersohn/software_builds/adam/adam-core/src/test/resources/bqsr1.sam")


I will fix this path.

AmplabJenkins · 2016-12-23T17:56:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1701/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 58cfe15 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 58cfe15 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 58cfe152a16c84931ba087d38ffa53a9397b5bbfFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

fnothaft

Generally LGTM! Just made a first review pass. For more detailed review, I need more inline docs.

In addition to the changes I suggested, please remove the .parquet test resource files from this PR. If needed for tests, they should be generated inside the test. We try to avoid checking in binary files whenever possible.

Linking to #1216.

fnothaft · 2016-12-23T18:15:41Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -62,8 +62,11 @@ import org.bdgenomics.utils.io.LocalFileByteAccess
 import org.bdgenomics.utils.misc.{ HadoopUtil, Logging }
 import org.seqdoop.hadoop_bam._
 import org.seqdoop.hadoop_bam.util._
+


Nit: remove space.

fnothaft · 2016-12-23T18:16:03Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -186,7 +189,6 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
   * @param filePath The (possibly globbed) filepath to load a VCF from.
   * @return Returns a tuple of metadata from the VCF header, including the
   *   sequence dictionary and a list of the samples contained in the VCF.
-   *


Nit: I'd prefer to keep these spaces.

Yeah, there are a lot of these whitespace changes and there shouldn't be any of them. ;)

Right, check yer IDE settings, or use one that doesn't make unwanted changes on your behalf :)

I will refix these. They were fixed at some point but my IDE is not cooperating :(

fnothaft · 2016-12-23T18:16:16Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param filename the filename for the metadata
+   * @return a partition map if the data was written sorted, or an empty Seq if unsorted
+   */
+  def determineIsSortedAndExtractPartitionMap(filename: String): Seq[(ReferenceRegion, ReferenceRegion)] = {


This should be (package) private.

fnothaft · 2016-12-23T18:18:49Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+      // this unfortunately seems to be the only way to do this
+      // avro does not seem to support getting metadata fields out once you have the from the string
+      val metaDataMap = JSON.parseFull(fr.getMetaString("avro.schema")).get.asInstanceOf[Map[String, String]]
+      //we want this for the use case


Can you expand this comment a bit? Actually, what might be preferable, is to have a longer inline comment that documents the format of what we are parsing. I think that would make this section of the code much easier to review.

fnothaft · 2016-12-23T18:18:58Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+      // parsing the json from the metadata header
+      // this unfortunately seems to be the only way to do this
+      // avro does not seem to support getting metadata fields out once you have the from the string
+      val metaDataMap = JSON.parseFull(fr.getMetaString("avro.schema")).get.asInstanceOf[Map[String, String]]


Why do we need the cast here?

fnothaft · 2016-12-23T19:12:04Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala

@@ -308,7 +308,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
    filePath: String,
    asType: Option[SAMFormat] = None,
    asSingleFile: Boolean = false,
-    isSorted: Boolean = false,
+    isSorted: Boolean = SortedTrait.isSorted,


Ditto RE: eliminating parameter.

fnothaft · 2016-12-23T19:12:08Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala

@@ -583,7 +583,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
   */
  def realignIndels(
    consensusModel: ConsensusGenerator = new ConsensusGeneratorFromReads,
-    isSorted: Boolean = false,
+    isSorted: Boolean = SortedTrait.isSorted,


Ditto RE: eliminating parameter.

fnothaft · 2016-12-23T19:12:30Z

adam-core/src/test/scala/org/bdgenomics/adam/models/RecordGroupDictionarySuite.scala

@@ -20,7 +20,7 @@ package org.bdgenomics.adam.models
 import htsjdk.samtools.SAMReadGroupRecord
 import org.scalatest.FunSuite

-class RecordGroupDictionarySuite extends FunSuite {
+class moRecordGroupDictionarySuite extends FunSuite {


Revert mo

fnothaft · 2016-12-23T19:12:40Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/SortedGenomicRDDSuite.scala

+import org.bdgenomics.adam.rdd.ADAMContext._
+import org.bdgenomics.utils.misc.SparkFunSuite
+
+/**


No author comments.

fnothaft · 2016-12-23T19:12:44Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/SortedGenomicRDDSuite.scala

@@ -0,0 +1,72 @@
+package org.bdgenomics.adam.rdd


Add license header.

Running ./scripts/format-source should add the license header.

AmplabJenkins · 2016-12-28T03:46:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1709/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 86150cf # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 86150cf (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 86150cfa21599654d53606727c562e3ae5c93ec3First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

devin-petersohn · 2016-12-28T06:54:49Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/variant/VariantContextRDD.scala

@@ -105,7 +108,7 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
   * Converts an RDD of ADAM VariantContexts to HTSJDK VariantContexts
   * and saves to disk as VCF.
   *
-   * @param filePath The filepath to save to.
+   * @param args The arguments for saving the data


Updated docs in a few places where they were not correct.

use full sentences for method parameter docs

fnothaft

Made an additional review pass. Let's chat about the boundary case where we have a record that is duplicated across partition boundaries.

fnothaft · 2016-12-28T07:29:38Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -296,8 +298,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
   * @param filePath The filepath to load a single Avro file of sequence
   *   dictionary info from.
   * @return Returns the SequenceDictionary representing said reference build.
-   *
-   * @see loadAvroSequences
+    * @see loadAvroSequences


Spaced one space too far.

fnothaft · 2016-12-28T07:29:43Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -330,8 +331,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
   * @param filePath The filepath to load a single Avro file containing read
   *   group metadata.
   * @return Returns a RecordGroupDictionary.
-   *
-   * @see loadAvroReadGroupMetadata
+    * @see loadAvroReadGroupMetadata


Spaced one space too far.

fnothaft · 2016-12-28T07:29:49Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

-   *
-   * @throws FileNotFoundException if the path does not match any files.
+    * @see getFsAndFiles
+    * @throws FileNotFoundException if the path does not match any files.


Spaced one space too far.

Also, lost whitespace.

fnothaft · 2016-12-28T07:30:01Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

-   * @see getFiles
-   *
-   * @throws FileNotFoundException if the path does not match any files.
+    * @see getFiles


Spaced one space too far.
Also, lost whitespace.

fnothaft · 2016-12-28T07:30:13Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

-   *
-   * @throws FileNotFoundException if the path does not match any files.
+    * @see getFiles
+    * @throws FileNotFoundException if the path does not match any files.


Spaced one space too far.
Also, lost whitespace.

fnothaft · 2016-12-28T08:06:19Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+    replaceRdd(partitionedRDD.values, Some(newPartitionMapRdd))
+  }
+
+  private[rdd] class GenomicPositionRangePartitioner[V](partitions: Int, elements: Int = 0) extends Partitioner {


Move to GenomicPartitioners.scala

Again, "genomic position" isn't used elsewhere, why not ReferenceRegionRangePartitioner?

GenomicPosition is used in GenomicPartitioners.scala.

fnothaft · 2016-12-28T08:07:02Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+    if (isSorted) {
+      sampleSchema.addProp("sorted", "true".asInstanceOf[Any])
+      sampleSchema.addProp("partitionMap", partitionMap.mkString(",").asInstanceOf[Any])
+    }


Add whitespace after.

fnothaft · 2016-12-28T08:08:48Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala

@@ -251,7 +251,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
   *
   * @return Returns a SAM/BAM formatted RDD of reads, as well as the file header.
   */
-  def convertToSam(isSorted: Boolean = false): (RDD[SAMRecordWritable], SAMFileHeader) = ConvertToSAM.time {
+  def convertToSam(isSorted: Boolean = isSorted): (RDD[SAMRecordWritable], SAMFileHeader) = ConvertToSAM.time {


There's going to be more logic that is needed here WRT dealing with records that are duplicated by the flanking process.

fnothaft · 2016-12-28T08:09:50Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/SortedGenomicRDDSuite.scala

@@ -0,0 +1,72 @@
+package org.bdgenomics.adam.rdd


Running ./scripts/format-source should add the license header.

fnothaft · 2016-12-28T08:10:12Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/SortedGenomicRDDSuite.scala

+ */
+class SortedGenomicRDDSuite extends SparkFunSuite {
+
+  def time[R](block: => R): R = {


Dropping a note here to remove this block before we merge.

heuermh · 2016-12-28T14:57:54Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -186,7 +189,6 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
   * @param filePath The (possibly globbed) filepath to load a VCF from.
   * @return Returns a tuple of metadata from the VCF header, including the
   *   sequence dictionary and a list of the samples contained in the VCF.
-   *


Right, check yer IDE settings, or use one that doesn't make unwanted changes on your behalf :)

heuermh · 2016-12-28T14:59:56Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param filename the filename for the metadata
+   * @return a partition map if the data was written sorted, or an empty Seq if unsorted
+   */
+  private[rdd] def determineIsSortedAndExtractPartitionMap(filename: String): Seq[(ReferenceRegion, ReferenceRegion)] = {


determineIsSortedAndExtractPartitionMap → extractPartitionMap

heuermh · 2016-12-28T15:00:33Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+    val maybePartitionMap = metaDataMap.get("partitionMap")
+    // we didn't write a partition map, which means this was not sorted at write
+    // or at least we didn't have information that it was sorted
+    if(maybePartitionMap.isEmpty) {


whitespace if( → if (

heuermh · 2016-12-28T15:02:30Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+      // Seq with commas separating each tuple of (ReferenceRegion, ReferenceRegion)
+      // The first split is breaking up the tuples. Each tuple starts with a
+      // "(" then a ReferenceRegion, so we are simply pulling out the tuples
+      // by using the start of each tuple as the indicator


How about adding ReferenceRegion to the bdg-formats schemas and using Avro to write out the partition map? I don't see why we should be writing some stuff out as Parquet, some as Avro, some as JSON, and some as Strings or byte-serialized JVM objects.

heuermh · 2016-12-28T15:03:42Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+      VariantRDD(rdd, sd, headers, maybePartitionMapRdd = Some(sc.parallelize(pMap, pMap.length)))
+      // if we have no information about partition map we assume unsorted
+    } else {
+      // default to isSorted = false


isSorted → sorted here and everywhere else

heuermh · 2016-12-28T15:12:05Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+   * @return A SortedGenomicRDDMixIn that contains the sorted and partitioned
+   *         RDD
+   */
+  def repartitionAndSortByGenomicCoordinate(partitions: Int = rdd.partitions.length)(implicit c: ClassTag[T]): U = {


we don't use the word coordinate elsewhere, how about repartitionAndSortByReferenceRegion?

Or just repartitionAndSort to be consistent with the sort/sortLexicographically methods on GenomicRDD.

heuermh · 2016-12-28T15:12:42Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+    replaceRdd(partitionedRDD.values, Some(newPartitionMapRdd))
+  }
+
+  private[rdd] class GenomicPositionRangePartitioner[V](partitions: Int, elements: Int = 0) extends Partitioner {


Again, "genomic position" isn't used elsewhere, why not ReferenceRegionRangePartitioner?

heuermh · 2016-12-28T15:14:16Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/contig/NucleotideContigFragmentRDD.scala

@@ -70,7 +70,10 @@ private[rdd] object NucleotideContigFragmentRDD extends Serializable {
 */
 case class NucleotideContigFragmentRDD(
    rdd: RDD[NucleotideContigFragment],
-    sequences: SequenceDictionary) extends AvroGenomicRDD[NucleotideContigFragment, NucleotideContigFragmentRDD] with ReferenceFile {
+    sequences: SequenceDictionary,
+    maybePartitionMapRdd: Option[RDD[(ReferenceRegion, ReferenceRegion)]] = None) extends AvroGenomicRDD[NucleotideContigFragment, NucleotideContigFragmentRDD] with ReferenceFile {


maybePartitionMapRdd → partitionMap or partitions

+1. I'd prefer partitionMap over partitions as partitions sounds more harmonious with Spark's abstract idea of what a Partition is, while we are abusing that notion.

If partitionMap remains an RDD (which you've argued against above), then it shouldn't really be called Map. :)

heuermh · 2016-12-28T15:15:02Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala

@@ -145,7 +145,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
   *   file was saved.
   */
  private[rdd] def maybeSaveBam(args: ADAMSaveAnyArgs,
-                                isSorted: Boolean = false): Boolean = {
+                                isSorted: Boolean = isSorted): Boolean = {


isSorted → sorted, as elsewhere

Where do we have sorted? I've generally been preferring isSorted.

We're not writing Java Beans. if (sorted) reads much better to me.

You're right that sorted is the prevailing usage, BTW:

adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep isSorted {} \; | wc 23 158 1295 adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep sorted {} \; | wc 48 463 3178

This disparity grows if you include test sources:

adam fnothaft$ find adam-*/src -name "*.scala" -exec grep isSorted {} \; | wc 30 179 1459 adam fnothaft$ find adam-*/src -name "*.scala" -exec grep sorted {} \; | wc 135 839 7236

Created #1341 to fix this later

Leaving a note here to say that we decided to use isSorted because of the potential collision with sorted in scala.collection.

Our isSorted is not 100% the same as the SAM/BAM definition of coordinate sorted. We'd need to filter out any replicated records.

heuermh · 2016-12-28T15:16:19Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/variant/VariantContextRDD.scala

@@ -105,7 +108,7 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
   * Converts an RDD of ADAM VariantContexts to HTSJDK VariantContexts
   * and saves to disk as VCF.
   *
-   * @param filePath The filepath to save to.
+   * @param args The arguments for saving the data


use full sentences for method parameter docs

heuermh · 2017-01-04T21:37:52Z

ok to set the milestone for this as 0.22.0?

fnothaft · 2017-01-04T21:39:54Z

+1 @heuermh, triaged to 0.22.0.

AmplabJenkins · 2017-01-11T12:13:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1729/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains e244fcfc34d1580be73e9c76e6dc72ceced60383 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision e244fcfc34d1580be73e9c76e6dc72ceced60383 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f e244fcfc34d1580be73e9c76e6dc72ceced60383First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins · 2017-01-11T12:22:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1730/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8f803afa8e862dff1626eb9002de0cb27175465a # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 8f803afa8e862dff1626eb9002de0cb27175465a (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8f803afa8e862dff1626eb9002de0cb27175465aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

devin-petersohn · 2017-01-11T17:19:30Z

@fnothaft @heuermh How do we handle cross strand joins in the shuffleRegionJoin code? From what I can tell, nothing special is done about it, but the bug I encountered involved the ReferenceRegion.overlap() skipping things that were not on the same strand.

fnothaft · 2017-01-11T17:26:10Z

Cross-strand joins are handled by

adam/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala

Line 363 in 1478739

y._1.overlaps(currentLeftRegion) &&

.

ReferenceRegion.overlap() skipping things that were not on the same strand.

Ah yes, things that are not on the same strand do not overlap. ;) This is intentional. IIRC, we clarified a lot of this in e98ee2d. I think the question here is less "what behavior is correct in ReferenceRegion.overlaps" and more "should stranded or unstranded reference regions be input into the region join"? Would you agree? CC @laserson who might have opinions as well.

laserson · 2017-01-11T18:56:46Z

Seems like we should support both stranded and un-stranded. Could this simply be passed through to the operation that determines whether there is an overlap?

fnothaft · 2017-01-11T19:02:27Z

Seems like we should support both stranded and un-stranded. Could this simply be passed through to the operation that determines whether there is an overlap?

I think there's a couple of ways we could do it. I think the simplest would be to generate unstranded reference regions as the keys to the join (after e98ee2d, this is v. easy). Perhaps we'd just pass this as a switch on the GenomicRDD methods?

AmplabJenkins · 2017-01-11T19:07:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1731/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8ccd18912301e91573c5245dc5c09edcf11fc337 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 8ccd18912301e91573c5245dc5c09edcf11fc337 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8ccd18912301e91573c5245dc5c09edcf11fc337First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins · 2017-01-11T19:42:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1733/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 2b68245 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 2b68245 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2b682452ac6d6453870022697b535357b023771dFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins · 2017-01-16T23:07:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1736/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains c9aac611a0d99b8f3e2e6266369d1fcf6467ef46 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision c9aac611a0d99b8f3e2e6266369d1fcf6467ef46 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c9aac611a0d99b8f3e2e6266369d1fcf6467ef46First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins · 2017-01-17T17:27:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1737/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b7204f839e6a69208114c630cf6cd025c13a6096 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision b7204f839e6a69208114c630cf6cd025c13a6096 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b7204f839e6a69208114c630cf6cd025c13a6096First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

devin-petersohn · 2017-01-17T22:21:37Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/variant/GenotypeRDDSuite.scala


-    val c = jRdd0.rdd.collect
+    val c = jRdd.rdd.collect


This seemed to be incorrect previously. Let me know if I should revert.

devin-petersohn · 2017-01-17T22:23:45Z

Pinging @fnothaft @heuermh for review. Please ignore the whitespace issues for now, I will push something shortly that fixes that.

AmplabJenkins · 2017-01-17T22:24:12Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1738/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains ec88406f303d32bcdaefa004bfd2c31ad748d080 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision ec88406f303d32bcdaefa004bfd2c31ad748d080 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f ec88406f303d32bcdaefa004bfd2c31ad748d080First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins · 2017-01-17T23:42:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1739/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 46af637fa97d2727a0077052d50d55270b18b707 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 46af637fa97d2727a0077052d50d55270b18b707 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 46af637fa97d2727a0077052d50d55270b18b707First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins · 2017-01-18T16:47:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1740/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains ab4cefe # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision ab4cefe (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f ab4cefeabf657914e554f7acd41e735144cfd5b2First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins · 2017-05-26T03:55:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2066/
Test PASSed.

fnothaft · 2017-05-26T04:42:32Z

Thanks @devin-petersohn! I'm getting on a plane right now but should have time tomorrow AM to review.

fnothaft · 2017-05-26T04:46:44Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMInputFormat.scala

+ *       we are only sorting a number of elements equal to the number of
+ *       partitions written.
+ */
+private[rdd] class ADAMInputFormat[T] extends ParquetInputFormat[T] {


Can we rename ADAMInputFormat to ADAMParquetInputFormat, à la ADAMBAMInputFormat, ADAMVCFInputFormat, etc?

fnothaft · 2017-05-26T04:59:59Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+      Some(partitionMapBuilder.toArray)
+    } catch {
+      case e: FileNotFoundException => None
+      // TODO: Log Exception


TODO ;)

Should we log this or rethrow?

I think rethrowing is correct here.

fnothaft · 2017-05-26T05:01:11Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param filename the filename for the metadata
+   * @return a partition map if the data was written sorted, or an empty Seq if unsorted
+   */
+  private[rdd] def extractPartitionMap(


IMO, we should move this into an avro schema or something more structured, instead of parsing JSON. That said, let's not do it here and now. Can you open a ticket to refactor this into avroland post 0.23.0?

fnothaft · 2017-05-26T10:39:34Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

   * Load a path name in Parquet + Avro format into an AlignmentRecordRDD.
   *
   * @note The sequence dictionary is read from an Avro file stored at
   *   pathName/_seqdict.avro and the record group dictionary is read from an
   *   Avro file stored at pathName/_rgdict.avro. These files are pure Avro,
   *   not Parquet + Avro.
-   *


Keep space.

fnothaft · 2017-05-26T10:40:19Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ManualRegionPartitioner.scala

+    key match {
+      case (_, f2: Int) => f2
+      case _ => {
+        throw new Exception("Unable to partition without destination assignment")


We should include the offending key in the exception message. Otherwise, the exception is impossible to debug if thrown.

fnothaft · 2017-05-26T11:03:11Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+    }).filter(f => f._1.nonEmpty).map(f => (f._1.get, f._2))
+      .sortBy(elem => elem._1, ascending = true, numPartitions = partitions)
+
+    partitionedRDD.cache()


I thought we were going to pull the partition mapping out of the sort code, so that people could sort without computing the partition map? Also, storage level should be parametrizable.

fnothaft · 2017-05-26T11:04:21Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+      "Cannot copartition with an unsorted rdd!")
+
+    val destinationPartitionMap = rddToCoPartitionWith.optPartitionMap.get
+    //number of partitions we will have after repartition


Nits for the code in this function:

there should be whitespace before a comment

space after //

fnothaft · 2017-05-26T11:06:57Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+      // the zipWithIndex gives us the destination partition ID
+      destinationPartitionMap.flatten.zipWithIndex.map(g => {
+        val (firstRegion, secondRegion, index) = (g._1._1, g._1._2, g._2)
+        // in the case where we span multiple referenceNames


Can you add a bit more documentation on this codepath? E.g., what is necessary to represent a partition, why is the first region sometimes enough, when isn't it, etc.

fnothaft · 2017-05-26T11:07:54Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+          // includes any extremely long regions. we include the firstRegion for
+          // the case that the first region is extremely long
+          (iter ++ Iterator(firstRegion)).maxBy(f => (f._1.referenceName, f._1.end, f._1.start))
+          // only one record on this partition, so this is the extent of the bounds


Nit: this comment should be in the else clause.

fnothaft · 2017-05-26T11:09:41Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala

-        (lo to hi).map(i => ((region, i), y))
-      })
+  def compute(): RDD[(RT, RU)] = {
+    leftRdd.zipPartitions(rightRdd, preservesPartitioning = true)(makeIterator)


Technically, this doesn't "preserve partitioning" as defined by Spark, right?

In this case, I doubt that it matters. It just lets Spark know that the partitioner it used (in this case ManualRegionPartitioner) to partition the data is still valid after this operation.

That's my point: isn't the ManualRegionPartitioner invalid after the zipPartitions, since the "key" now has type RT and the partitioner expects type (_, Int)?

+1 I will remove it.

heuermh · 2017-05-26T16:02:13Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala

+   *
+   * @param rdd The underlying AlignmentRecord RDD.
+   * @return A new AlignmentRecordRDD.
+   */
  def unaligned(rdd: RDD[AlignmentRecord]): AlignmentRecordRDD = {
    AlignmentRecordRDD(rdd,
      SequenceDictionary.empty,


This could call AlignmentRecordRDD(rdd, sequences, recordGroupDictionary, None) directly

Or even better, it could call the constructor.

That's what I meant. Or is there a different syntax?

See next 2 lines for solution

heuermh · 2017-05-26T16:05:09Z

All the apply refactoring looks good to me, thanks!

devin-petersohn · 2017-05-26T18:17:36Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/variant/VariantRDDSuite.scala

@@ -196,7 +196,7 @@ class VariantRDDSuite extends ADAMFunSuite {
    // we can't guarantee that we get exactly the number of partitions requested,
    // we get close though
    assert(jRdd.rdd.partitions.length === 1)
-    assert(jRdd0.rdd.partitions.length === 5)
+    assert(jRdd0.rdd.partitions.length === 4)

    val c = jRdd0.rdd.collect


This is a bug. I will fix it.

coveralls · 2017-05-26T18:21:50Z

Coverage decreased (-0.07%) to 81.967% when pulling 7805a7f on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

coveralls · 2017-05-26T18:21:50Z

Coverage increased (+0.4%) to 82.418% when pulling 7805a7f on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

AmplabJenkins · 2017-05-26T18:25:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2067/
Test PASSed.

coveralls · 2017-05-26T18:46:14Z

Coverage decreased (-0.08%) to 81.96% when pulling f15fe55 on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

coveralls · 2017-05-26T18:46:15Z

Coverage increased (+0.6%) to 82.599% when pulling f15fe55 on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

AmplabJenkins · 2017-05-26T18:49:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2069/
Test PASSed.

devin-petersohn · 2017-05-30T22:51:37Z

@fnothaft, @heuermh, I think we're good to go for another review. At your earliest convenience of course.

heuermh

Boom!

coveralls · 2017-05-31T18:11:39Z

Coverage increased (+0.6%) to 82.599% when pulling 536b936 on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

AmplabJenkins · 2017-05-31T18:14:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2076/
Test PASSed.

fnothaft · 2017-05-31T19:12:17Z

@devin-petersohn you've got a conflict in adam-core/src/main/scala/org/bdgenomics/adam/rdd/variant/VariantContextRDD.scala. Can you rebase and clear the conflict?

Addressing reviewer comments Performance improvements related to decreasing compute Major refactor to clean up ShuffleRegionJoin Fixing serialization issue Cleaning up tests and code More cleanup Addressing reviewer comments Addressing reviewer comments, cleaning up after intellij Testing that data stays sorted on Jenkins Still debugging the sorted persistance Testing that we can persist sort Checking output to hopefully see what is happening Testing more related to sort persist Adding ADAMInputFormat class to ensure ordering is maintained Adding docs and cleanup Cleaning up code a bit Cleaning up after IntelliJ Addressing reviewer comments Addressing reviewer comments Addressing reviewer comments Clean up docs, cutting at 80 characters. Addressing reviewer comments Addressing reviewer comments. sorted to isSorted Fixing some spacing issues Addressing reviwer comments

devin-petersohn · 2017-05-31T20:37:29Z

@fnothaft Rebased, thanks!

coveralls · 2017-05-31T20:46:37Z

Coverage decreased (-0.2%) to 82.569% when pulling 7a9503c on devin-petersohn:partitioner into b7762c2 on bigdatagenomics:master.

AmplabJenkins · 2017-05-31T20:50:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2077/
Test PASSed.

fnothaft · 2017-06-05T17:51:42Z

Merged! Thanks @devin-petersohn!

devin-petersohn commented Dec 23, 2016

View reviewed changes

fnothaft requested changes Dec 23, 2016

View reviewed changes

devin-petersohn commented Dec 28, 2016

View reviewed changes

fnothaft requested changes Dec 28, 2016

View reviewed changes

heuermh requested changes Dec 28, 2016

View reviewed changes

heuermh mentioned this pull request Jan 4, 2017

Refactor isSorted boolean parameters to sorted #1341

Closed

fnothaft added this to the 0.22.0 milestone Jan 4, 2017

devin-petersohn commented Jan 17, 2017

View reviewed changes

fnothaft requested changes May 26, 2017

View reviewed changes

heuermh requested changes May 26, 2017

View reviewed changes

devin-petersohn commented May 26, 2017

View reviewed changes

heuermh approved these changes May 31, 2017

View reviewed changes

devin-petersohn added 4 commits May 31, 2017 13:35

Addressing reviewer comments

1dae006

Resolving bugs in unit tests

e77afb4

Fixing a spacing issue

7a9503c

devin-petersohn force-pushed the partitioner branch from 536b936 to 7a9503c Compare May 31, 2017 20:36

fnothaft approved these changes Jun 5, 2017

View reviewed changes

fnothaft merged commit e5ae270 into bigdatagenomics:master Jun 5, 2017

devin-petersohn mentioned this pull request Jun 5, 2017

Maintaining sorted/partitioned knowledge #1216

Closed

devin-petersohn mentioned this pull request Jul 7, 2017

[ADAM-1018] Add support for Spark SQL Datasets. #1391

Merged

4 tasks

Efficient Joins and (re)Partitioning #1324

Efficient Joins and (re)Partitioning #1324

Conversation

devin-petersohn commented Dec 23, 2016 • edited Loading

AmplabJenkins commented Dec 23, 2016

fnothaft commented Dec 23, 2016

AmplabJenkins commented Dec 23, 2016

Build result: FAILURE

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Dec 23, 2016

Build result: FAILURE

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Dec 28, 2016

Build result: FAILURE

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh commented Jan 4, 2017

fnothaft commented Jan 4, 2017

AmplabJenkins commented Jan 11, 2017

Build result: FAILURE

AmplabJenkins commented Jan 11, 2017

Build result: FAILURE

devin-petersohn commented Jan 11, 2017

fnothaft commented Jan 11, 2017

laserson commented Jan 11, 2017

fnothaft commented Jan 11, 2017

AmplabJenkins commented Jan 11, 2017

Build result: FAILURE

devin-petersohn commented Dec 23, 2016 •

edited

Loading