Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient Joins and (re)Partitioning #1324

Merged
merged 4 commits into from
Jun 5, 2017

Conversation

devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Dec 23, 2016

Ready for review.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@fnothaft
Copy link
Member

Jenkins, add to whitelist.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1700/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 1bbc8ab # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 1bbc8ab (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1bbc8abae508530dab55ae88be11935a74b594e0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

val i = z.leftOuterShuffleRegionJoin(x).rdd.collect
assert(h.length == i.length)

val t = sc.loadParquetAlignments("/Users/DevinPetersohn/software_builds/adam/adam-core/src/test/resources/sortedAlignments.parquet.txt")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this path.

sparkTest("testing partitioner") {
time {
//val x = sc.loadBam("/data/recompute/alignments/NA12878.bam.aln.bam")
val x = sc.loadBam("/Users/DevinPetersohn/software_builds/adam/adam-core/src/test/resources/bqsr1.sam")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this path.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1701/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 58cfe15 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 58cfe15 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 58cfe152a16c84931ba087d38ffa53a9397b5bbfFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM! Just made a first review pass. For more detailed review, I need more inline docs.

In addition to the changes I suggested, please remove the .parquet test resource files from this PR. If needed for tests, they should be generated inside the test. We try to avoid checking in binary files whenever possible.

Linking to #1216.

@@ -62,8 +62,11 @@ import org.bdgenomics.utils.io.LocalFileByteAccess
import org.bdgenomics.utils.misc.{ HadoopUtil, Logging }
import org.seqdoop.hadoop_bam._
import org.seqdoop.hadoop_bam.util._

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove space.

@@ -186,7 +189,6 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
* @param filePath The (possibly globbed) filepath to load a VCF from.
* @return Returns a tuple of metadata from the VCF header, including the
* sequence dictionary and a list of the samples contained in the VCF.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd prefer to keep these spaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are a lot of these whitespace changes and there shouldn't be any of them. ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, check yer IDE settings, or use one that doesn't make unwanted changes on your behalf :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refix these. They were fixed at some point but my IDE is not cooperating :(

* @param filename the filename for the metadata
* @return a partition map if the data was written sorted, or an empty Seq if unsorted
*/
def determineIsSortedAndExtractPartitionMap(filename: String): Seq[(ReferenceRegion, ReferenceRegion)] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be (package) private.

// this unfortunately seems to be the only way to do this
// avro does not seem to support getting metadata fields out once you have the from the string
val metaDataMap = JSON.parseFull(fr.getMetaString("avro.schema")).get.asInstanceOf[Map[String, String]]
//we want this for the use case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand this comment a bit? Actually, what might be preferable, is to have a longer inline comment that documents the format of what we are parsing. I think that would make this section of the code much easier to review.

// parsing the json from the metadata header
// this unfortunately seems to be the only way to do this
// avro does not seem to support getting metadata fields out once you have the from the string
val metaDataMap = JSON.parseFull(fr.getMetaString("avro.schema")).get.asInstanceOf[Map[String, String]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the cast here?

@@ -308,7 +308,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
filePath: String,
asType: Option[SAMFormat] = None,
asSingleFile: Boolean = false,
isSorted: Boolean = false,
isSorted: Boolean = SortedTrait.isSorted,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto RE: eliminating parameter.

@@ -583,7 +583,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
*/
def realignIndels(
consensusModel: ConsensusGenerator = new ConsensusGeneratorFromReads,
isSorted: Boolean = false,
isSorted: Boolean = SortedTrait.isSorted,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto RE: eliminating parameter.

@@ -20,7 +20,7 @@ package org.bdgenomics.adam.models
import htsjdk.samtools.SAMReadGroupRecord
import org.scalatest.FunSuite

class RecordGroupDictionarySuite extends FunSuite {
class moRecordGroupDictionarySuite extends FunSuite {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert mo

import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.utils.misc.SparkFunSuite

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No author comments.

@@ -0,0 +1,72 @@
package org.bdgenomics.adam.rdd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add license header.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running ./scripts/format-source should add the license header.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1709/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 86150cf # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 86150cf (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 86150cfa21599654d53606727c562e3ae5c93ec3First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@@ -105,7 +108,7 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
* Converts an RDD of ADAM VariantContexts to HTSJDK VariantContexts
* and saves to disk as VCF.
*
* @param filePath The filepath to save to.
* @param args The arguments for saving the data
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated docs in a few places where they were not correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use full sentences for method parameter docs

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an additional review pass. Let's chat about the boundary case where we have a record that is duplicated across partition boundaries.

@@ -296,8 +298,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
* @param filePath The filepath to load a single Avro file of sequence
* dictionary info from.
* @return Returns the SequenceDictionary representing said reference build.
*
* @see loadAvroSequences
* @see loadAvroSequences
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaced one space too far.

@@ -330,8 +331,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
* @param filePath The filepath to load a single Avro file containing read
* group metadata.
* @return Returns a RecordGroupDictionary.
*
* @see loadAvroReadGroupMetadata
* @see loadAvroReadGroupMetadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaced one space too far.

*
* @throws FileNotFoundException if the path does not match any files.
* @see getFsAndFiles
* @throws FileNotFoundException if the path does not match any files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaced one space too far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, lost whitespace.

* @see getFiles
*
* @throws FileNotFoundException if the path does not match any files.
* @see getFiles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaced one space too far.
Also, lost whitespace.

*
* @throws FileNotFoundException if the path does not match any files.
* @see getFiles
* @throws FileNotFoundException if the path does not match any files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaced one space too far.
Also, lost whitespace.

replaceRdd(partitionedRDD.values, Some(newPartitionMapRdd))
}

private[rdd] class GenomicPositionRangePartitioner[V](partitions: Int, elements: Int = 0) extends Partitioner {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to GenomicPartitioners.scala

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, "genomic position" isn't used elsewhere, why not ReferenceRegionRangePartitioner?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenomicPosition is used in GenomicPartitioners.scala.

if (isSorted) {
sampleSchema.addProp("sorted", "true".asInstanceOf[Any])
sampleSchema.addProp("partitionMap", partitionMap.mkString(",").asInstanceOf[Any])
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add whitespace after.

@@ -251,7 +251,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
*
* @return Returns a SAM/BAM formatted RDD of reads, as well as the file header.
*/
def convertToSam(isSorted: Boolean = false): (RDD[SAMRecordWritable], SAMFileHeader) = ConvertToSAM.time {
def convertToSam(isSorted: Boolean = isSorted): (RDD[SAMRecordWritable], SAMFileHeader) = ConvertToSAM.time {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's going to be more logic that is needed here WRT dealing with records that are duplicated by the flanking process.

@@ -0,0 +1,72 @@
package org.bdgenomics.adam.rdd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running ./scripts/format-source should add the license header.

*/
class SortedGenomicRDDSuite extends SparkFunSuite {

def time[R](block: => R): R = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping a note here to remove this block before we merge.

@@ -186,7 +189,6 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
* @param filePath The (possibly globbed) filepath to load a VCF from.
* @return Returns a tuple of metadata from the VCF header, including the
* sequence dictionary and a list of the samples contained in the VCF.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, check yer IDE settings, or use one that doesn't make unwanted changes on your behalf :)

* @param filename the filename for the metadata
* @return a partition map if the data was written sorted, or an empty Seq if unsorted
*/
private[rdd] def determineIsSortedAndExtractPartitionMap(filename: String): Seq[(ReferenceRegion, ReferenceRegion)] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

determineIsSortedAndExtractPartitionMapextractPartitionMap

val maybePartitionMap = metaDataMap.get("partitionMap")
// we didn't write a partition map, which means this was not sorted at write
// or at least we didn't have information that it was sorted
if(maybePartitionMap.isEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace if(if (

// Seq with commas separating each tuple of (ReferenceRegion, ReferenceRegion)
// The first split is breaking up the tuples. Each tuple starts with a
// "(" then a ReferenceRegion, so we are simply pulling out the tuples
// by using the start of each tuple as the indicator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding ReferenceRegion to the bdg-formats schemas and using Avro to write out the partition map? I don't see why we should be writing some stuff out as Parquet, some as Avro, some as JSON, and some as Strings or byte-serialized JVM objects.

VariantRDD(rdd, sd, headers, maybePartitionMapRdd = Some(sc.parallelize(pMap, pMap.length)))
// if we have no information about partition map we assume unsorted
} else {
// default to isSorted = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSortedsorted here and everywhere else

* @return A SortedGenomicRDDMixIn that contains the sorted and partitioned
* RDD
*/
def repartitionAndSortByGenomicCoordinate(partitions: Int = rdd.partitions.length)(implicit c: ClassTag[T]): U = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use the word coordinate elsewhere, how about repartitionAndSortByReferenceRegion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just repartitionAndSort to be consistent with the sort/sortLexicographically methods on GenomicRDD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

replaceRdd(partitionedRDD.values, Some(newPartitionMapRdd))
}

private[rdd] class GenomicPositionRangePartitioner[V](partitions: Int, elements: Int = 0) extends Partitioner {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, "genomic position" isn't used elsewhere, why not ReferenceRegionRangePartitioner?

@@ -70,7 +70,10 @@ private[rdd] object NucleotideContigFragmentRDD extends Serializable {
*/
case class NucleotideContigFragmentRDD(
rdd: RDD[NucleotideContigFragment],
sequences: SequenceDictionary) extends AvroGenomicRDD[NucleotideContigFragment, NucleotideContigFragmentRDD] with ReferenceFile {
sequences: SequenceDictionary,
maybePartitionMapRdd: Option[RDD[(ReferenceRegion, ReferenceRegion)]] = None) extends AvroGenomicRDD[NucleotideContigFragment, NucleotideContigFragmentRDD] with ReferenceFile {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybePartitionMapRddpartitionMap or partitions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'd prefer partitionMap over partitions as partitions sounds more harmonious with Spark's abstract idea of what a Partition is, while we are abusing that notion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If partitionMap remains an RDD (which you've argued against above), then it shouldn't really be called Map. :)

@@ -145,7 +145,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
* file was saved.
*/
private[rdd] def maybeSaveBam(args: ADAMSaveAnyArgs,
isSorted: Boolean = false): Boolean = {
isSorted: Boolean = isSorted): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSortedsorted, as elsewhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we have sorted? I've generally been preferring isSorted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not writing Java Beans. if (sorted) reads much better to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that sorted is the prevailing usage, BTW:

adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep isSorted {} \; | wc
      23     158    1295
adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep sorted {} \; | wc
      48     463    3178

This disparity grows if you include test sources:

adam fnothaft$ find adam-*/src -name "*.scala" -exec grep isSorted {} \; | wc
      30     179    1459
adam fnothaft$ find adam-*/src -name "*.scala" -exec grep sorted {} \; | wc
     135     839    7236

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that sorted is the prevailing usage, BTW:

adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep isSorted {} \; | wc
      23     158    1295
adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep sorted {} \; | wc
      48     463    3178

This disparity grows if you include test sources:

adam fnothaft$ find adam-*/src -name "*.scala" -exec grep isSorted {} \; | wc
      30     179    1459
adam fnothaft$ find adam-*/src -name "*.scala" -exec grep sorted {} \; | wc
     135     839    7236

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #1341 to fix this later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a note here to say that we decided to use isSorted because of the potential collision with sorted in scala.collection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our isSorted is not 100% the same as the SAM/BAM definition of coordinate sorted. We'd need to filter out any replicated records.

@@ -105,7 +108,7 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
* Converts an RDD of ADAM VariantContexts to HTSJDK VariantContexts
* and saves to disk as VCF.
*
* @param filePath The filepath to save to.
* @param args The arguments for saving the data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use full sentences for method parameter docs

@heuermh
Copy link
Member

heuermh commented Jan 4, 2017

ok to set the milestone for this as 0.22.0?

@fnothaft fnothaft added this to the 0.22.0 milestone Jan 4, 2017
@fnothaft
Copy link
Member

fnothaft commented Jan 4, 2017

+1 @heuermh, triaged to 0.22.0.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1729/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains e244fcfc34d1580be73e9c76e6dc72ceced60383 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision e244fcfc34d1580be73e9c76e6dc72ceced60383 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f e244fcfc34d1580be73e9c76e6dc72ceced60383First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1730/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8f803afa8e862dff1626eb9002de0cb27175465a # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 8f803afa8e862dff1626eb9002de0cb27175465a (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8f803afa8e862dff1626eb9002de0cb27175465aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@devin-petersohn
Copy link
Member Author

@fnothaft @heuermh How do we handle cross strand joins in the shuffleRegionJoin code? From what I can tell, nothing special is done about it, but the bug I encountered involved the ReferenceRegion.overlap() skipping things that were not on the same strand.

@fnothaft
Copy link
Member

Cross-strand joins are handled by

.

ReferenceRegion.overlap() skipping things that were not on the same strand.

Ah yes, things that are not on the same strand do not overlap. ;) This is intentional. IIRC, we clarified a lot of this in e98ee2d. I think the question here is less "what behavior is correct in ReferenceRegion.overlaps" and more "should stranded or unstranded reference regions be input into the region join"? Would you agree? CC @laserson who might have opinions as well.

@laserson
Copy link
Contributor

Seems like we should support both stranded and un-stranded. Could this simply be passed through to the operation that determines whether there is an overlap?

@fnothaft
Copy link
Member

Seems like we should support both stranded and un-stranded. Could this simply be passed through to the operation that determines whether there is an overlap?

I think there's a couple of ways we could do it. I think the simplest would be to generate unstranded reference regions as the keys to the join (after e98ee2d, this is v. easy). Perhaps we'd just pass this as a switch on the GenomicRDD methods?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1731/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8ccd18912301e91573c5245dc5c09edcf11fc337 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 8ccd18912301e91573c5245dc5c09edcf11fc337 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8ccd18912301e91573c5245dc5c09edcf11fc337First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1733/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 2b68245 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 2b68245 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2b682452ac6d6453870022697b535357b023771dFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1736/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains c9aac611a0d99b8f3e2e6266369d1fcf6467ef46 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision c9aac611a0d99b8f3e2e6266369d1fcf6467ef46 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c9aac611a0d99b8f3e2e6266369d1fcf6467ef46First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1737/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b7204f839e6a69208114c630cf6cd025c13a6096 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision b7204f839e6a69208114c630cf6cd025c13a6096 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b7204f839e6a69208114c630cf6cd025c13a6096First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.


val c = jRdd0.rdd.collect
val c = jRdd.rdd.collect
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to be incorrect previously. Let me know if I should revert.

@devin-petersohn
Copy link
Member Author

Pinging @fnothaft @heuermh for review. Please ignore the whitespace issues for now, I will push something shortly that fixes that.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1738/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains ec88406f303d32bcdaefa004bfd2c31ad748d080 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision ec88406f303d32bcdaefa004bfd2c31ad748d080 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f ec88406f303d32bcdaefa004bfd2c31ad748d080First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1739/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 46af637fa97d2727a0077052d50d55270b18b707 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 46af637fa97d2727a0077052d50d55270b18b707 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 46af637fa97d2727a0077052d50d55270b18b707First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1740/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains ab4cefe # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision ab4cefe (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f ab4cefeabf657914e554f7acd41e735144cfd5b2First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2066/
Test PASSed.

@fnothaft
Copy link
Member

Thanks @devin-petersohn! I'm getting on a plane right now but should have time tomorrow AM to review.

* we are only sorting a number of elements equal to the number of
* partitions written.
*/
private[rdd] class ADAMInputFormat[T] extends ParquetInputFormat[T] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename ADAMInputFormat to ADAMParquetInputFormat, à la ADAMBAMInputFormat, ADAMVCFInputFormat, etc?

Some(partitionMapBuilder.toArray)
} catch {
case e: FileNotFoundException => None
// TODO: Log Exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO ;)

Should we log this or rethrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rethrowing is correct here.

* @param filename the filename for the metadata
* @return a partition map if the data was written sorted, or an empty Seq if unsorted
*/
private[rdd] def extractPartitionMap(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should move this into an avro schema or something more structured, instead of parsing JSON. That said, let's not do it here and now. Can you open a ticket to refactor this into avroland post 0.23.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

* Load a path name in Parquet + Avro format into an AlignmentRecordRDD.
*
* @note The sequence dictionary is read from an Avro file stored at
* pathName/_seqdict.avro and the record group dictionary is read from an
* Avro file stored at pathName/_rgdict.avro. These files are pure Avro,
* not Parquet + Avro.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep space.

key match {
case (_, f2: Int) => f2
case _ => {
throw new Exception("Unable to partition without destination assignment")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the offending key in the exception message. Otherwise, the exception is impossible to debug if thrown.

}).filter(f => f._1.nonEmpty).map(f => (f._1.get, f._2))
.sortBy(elem => elem._1, ascending = true, numPartitions = partitions)

partitionedRDD.cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to pull the partition mapping out of the sort code, so that people could sort without computing the partition map? Also, storage level should be parametrizable.

"Cannot copartition with an unsorted rdd!")

val destinationPartitionMap = rddToCoPartitionWith.optPartitionMap.get
//number of partitions we will have after repartition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits for the code in this function:

  • there should be whitespace before a comment
  • space after //

// the zipWithIndex gives us the destination partition ID
destinationPartitionMap.flatten.zipWithIndex.map(g => {
val (firstRegion, secondRegion, index) = (g._1._1, g._1._2, g._2)
// in the case where we span multiple referenceNames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit more documentation on this codepath? E.g., what is necessary to represent a partition, why is the first region sometimes enough, when isn't it, etc.

// includes any extremely long regions. we include the firstRegion for
// the case that the first region is extremely long
(iter ++ Iterator(firstRegion)).maxBy(f => (f._1.referenceName, f._1.end, f._1.start))
// only one record on this partition, so this is the extent of the bounds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this comment should be in the else clause.

(lo to hi).map(i => ((region, i), y))
})
def compute(): RDD[(RT, RU)] = {
leftRdd.zipPartitions(rightRdd, preservesPartitioning = true)(makeIterator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this doesn't "preserve partitioning" as defined by Spark, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I doubt that it matters. It just lets Spark know that the partitioner it used (in this case ManualRegionPartitioner) to partition the data is still valid after this operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my point: isn't the ManualRegionPartitioner invalid after the zipPartitions, since the "key" now has type RT and the partitioner expects type (_, Int)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I will remove it.

*
* @param rdd The underlying AlignmentRecord RDD.
* @return A new AlignmentRecordRDD.
*/
def unaligned(rdd: RDD[AlignmentRecord]): AlignmentRecordRDD = {
AlignmentRecordRDD(rdd,
SequenceDictionary.empty,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could call AlignmentRecordRDD(rdd, sequences, recordGroupDictionary, None) directly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even better, it could call the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I meant. Or is there a different syntax?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See next 2 lines for solution

@heuermh
Copy link
Member

heuermh commented May 26, 2017

All the apply refactoring looks good to me, thanks!

@@ -196,7 +196,7 @@ class VariantRDDSuite extends ADAMFunSuite {
// we can't guarantee that we get exactly the number of partitions requested,
// we get close though
assert(jRdd.rdd.partitions.length === 1)
assert(jRdd0.rdd.partitions.length === 5)
assert(jRdd0.rdd.partitions.length === 4)

val c = jRdd0.rdd.collect
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug. I will fix it.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.07%) to 81.967% when pulling 7805a7f on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@coveralls
Copy link

coveralls commented May 26, 2017

Coverage Status

Coverage increased (+0.4%) to 82.418% when pulling 7805a7f on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2067/
Test PASSed.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.08%) to 81.96% when pulling f15fe55 on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@coveralls
Copy link

coveralls commented May 26, 2017

Coverage Status

Coverage increased (+0.6%) to 82.599% when pulling f15fe55 on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2069/
Test PASSed.

@devin-petersohn
Copy link
Member Author

devin-petersohn commented May 30, 2017

@fnothaft, @heuermh, I think we're good to go for another review. At your earliest convenience of course.

Copy link
Member

@heuermh heuermh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boom!

@coveralls
Copy link

coveralls commented May 31, 2017

Coverage Status

Coverage increased (+0.6%) to 82.599% when pulling 536b936 on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2076/
Test PASSed.

@fnothaft
Copy link
Member

@devin-petersohn you've got a conflict in adam-core/src/main/scala/org/bdgenomics/adam/rdd/variant/VariantContextRDD.scala. Can you rebase and clear the conflict?

Addressing reviewer comments

Performance improvements related to decreasing compute

Major refactor to clean up ShuffleRegionJoin

Fixing serialization issue

Cleaning up tests and code

More cleanup

Addressing reviewer comments

Addressing reviewer comments, cleaning up after intellij

Testing that data stays sorted on Jenkins

Still debugging the sorted persistance

Testing that we can persist sort

Checking output to hopefully see what is happening

Testing more related to sort persist

Adding ADAMInputFormat class to ensure ordering is maintained

Adding docs and cleanup

Cleaning up code a bit

Cleaning up after IntelliJ

Addressing reviewer comments

Addressing reviewer comments

Addressing reviewer comments

Clean up docs, cutting at 80 characters.

Addressing reviewer comments

Addressing reviewer comments. sorted to isSorted

Fixing some spacing issues

Addressing reviwer comments
@devin-petersohn
Copy link
Member Author

@fnothaft Rebased, thanks!

@coveralls
Copy link

coveralls commented May 31, 2017

Coverage Status

Coverage decreased (-0.2%) to 82.569% when pulling 7a9503c on devin-petersohn:partitioner into b7762c2 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2077/
Test PASSed.

@fnothaft fnothaft merged commit e5ae270 into bigdatagenomics:master Jun 5, 2017
@fnothaft
Copy link
Member

fnothaft commented Jun 5, 2017

Merged! Thanks @devin-petersohn!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants