-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient Joins and (re)Partitioning #1324
Conversation
Can one of the admins verify this patch? |
Jenkins, add to whitelist. |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 1bbc8ab # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 1bbc8ab (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1bbc8abae508530dab55ae88be11935a74b594e0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
val i = z.leftOuterShuffleRegionJoin(x).rdd.collect | ||
assert(h.length == i.length) | ||
|
||
val t = sc.loadParquetAlignments("/Users/DevinPetersohn/software_builds/adam/adam-core/src/test/resources/sortedAlignments.parquet.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this path.
sparkTest("testing partitioner") { | ||
time { | ||
//val x = sc.loadBam("/data/recompute/alignments/NA12878.bam.aln.bam") | ||
val x = sc.loadBam("/Users/DevinPetersohn/software_builds/adam/adam-core/src/test/resources/bqsr1.sam") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this path.
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 58cfe15 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 58cfe15 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 58cfe152a16c84931ba087d38ffa53a9397b5bbfFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM! Just made a first review pass. For more detailed review, I need more inline docs.
In addition to the changes I suggested, please remove the .parquet
test resource files from this PR. If needed for tests, they should be generated inside the test. We try to avoid checking in binary files whenever possible.
Linking to #1216.
@@ -62,8 +62,11 @@ import org.bdgenomics.utils.io.LocalFileByteAccess | |||
import org.bdgenomics.utils.misc.{ HadoopUtil, Logging } | |||
import org.seqdoop.hadoop_bam._ | |||
import org.seqdoop.hadoop_bam.util._ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove space.
@@ -186,7 +189,6 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log | |||
* @param filePath The (possibly globbed) filepath to load a VCF from. | |||
* @return Returns a tuple of metadata from the VCF header, including the | |||
* sequence dictionary and a list of the samples contained in the VCF. | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd prefer to keep these spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there are a lot of these whitespace changes and there shouldn't be any of them. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, check yer IDE settings, or use one that doesn't make unwanted changes on your behalf :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will refix these. They were fixed at some point but my IDE is not cooperating :(
* @param filename the filename for the metadata | ||
* @return a partition map if the data was written sorted, or an empty Seq if unsorted | ||
*/ | ||
def determineIsSortedAndExtractPartitionMap(filename: String): Seq[(ReferenceRegion, ReferenceRegion)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be (package) private.
// this unfortunately seems to be the only way to do this | ||
// avro does not seem to support getting metadata fields out once you have the from the string | ||
val metaDataMap = JSON.parseFull(fr.getMetaString("avro.schema")).get.asInstanceOf[Map[String, String]] | ||
//we want this for the use case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand this comment a bit? Actually, what might be preferable, is to have a longer inline comment that documents the format of what we are parsing. I think that would make this section of the code much easier to review.
// parsing the json from the metadata header | ||
// this unfortunately seems to be the only way to do this | ||
// avro does not seem to support getting metadata fields out once you have the from the string | ||
val metaDataMap = JSON.parseFull(fr.getMetaString("avro.schema")).get.asInstanceOf[Map[String, String]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the cast here?
@@ -308,7 +308,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord, | |||
filePath: String, | |||
asType: Option[SAMFormat] = None, | |||
asSingleFile: Boolean = false, | |||
isSorted: Boolean = false, | |||
isSorted: Boolean = SortedTrait.isSorted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto RE: eliminating parameter.
@@ -583,7 +583,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord, | |||
*/ | |||
def realignIndels( | |||
consensusModel: ConsensusGenerator = new ConsensusGeneratorFromReads, | |||
isSorted: Boolean = false, | |||
isSorted: Boolean = SortedTrait.isSorted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto RE: eliminating parameter.
@@ -20,7 +20,7 @@ package org.bdgenomics.adam.models | |||
import htsjdk.samtools.SAMReadGroupRecord | |||
import org.scalatest.FunSuite | |||
|
|||
class RecordGroupDictionarySuite extends FunSuite { | |||
class moRecordGroupDictionarySuite extends FunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert mo
import org.bdgenomics.adam.rdd.ADAMContext._ | ||
import org.bdgenomics.utils.misc.SparkFunSuite | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No author comments.
@@ -0,0 +1,72 @@ | |||
package org.bdgenomics.adam.rdd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add license header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running ./scripts/format-source
should add the license header.
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 86150cf # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 86150cf (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 86150cfa21599654d53606727c562e3ae5c93ec3First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
@@ -105,7 +108,7 @@ case class VariantContextRDD(rdd: RDD[VariantContext], | |||
* Converts an RDD of ADAM VariantContexts to HTSJDK VariantContexts | |||
* and saves to disk as VCF. | |||
* | |||
* @param filePath The filepath to save to. | |||
* @param args The arguments for saving the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated docs in a few places where they were not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use full sentences for method parameter docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made an additional review pass. Let's chat about the boundary case where we have a record that is duplicated across partition boundaries.
@@ -296,8 +298,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log | |||
* @param filePath The filepath to load a single Avro file of sequence | |||
* dictionary info from. | |||
* @return Returns the SequenceDictionary representing said reference build. | |||
* | |||
* @see loadAvroSequences | |||
* @see loadAvroSequences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaced one space too far.
@@ -330,8 +331,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log | |||
* @param filePath The filepath to load a single Avro file containing read | |||
* group metadata. | |||
* @return Returns a RecordGroupDictionary. | |||
* | |||
* @see loadAvroReadGroupMetadata | |||
* @see loadAvroReadGroupMetadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaced one space too far.
* | ||
* @throws FileNotFoundException if the path does not match any files. | ||
* @see getFsAndFiles | ||
* @throws FileNotFoundException if the path does not match any files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaced one space too far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, lost whitespace.
* @see getFiles | ||
* | ||
* @throws FileNotFoundException if the path does not match any files. | ||
* @see getFiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaced one space too far.
Also, lost whitespace.
* | ||
* @throws FileNotFoundException if the path does not match any files. | ||
* @see getFiles | ||
* @throws FileNotFoundException if the path does not match any files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaced one space too far.
Also, lost whitespace.
replaceRdd(partitionedRDD.values, Some(newPartitionMapRdd)) | ||
} | ||
|
||
private[rdd] class GenomicPositionRangePartitioner[V](partitions: Int, elements: Int = 0) extends Partitioner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to GenomicPartitioners.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, "genomic position" isn't used elsewhere, why not ReferenceRegionRangePartitioner
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenomicPosition
is used in GenomicPartitioners.scala
.
if (isSorted) { | ||
sampleSchema.addProp("sorted", "true".asInstanceOf[Any]) | ||
sampleSchema.addProp("partitionMap", partitionMap.mkString(",").asInstanceOf[Any]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add whitespace after.
@@ -251,7 +251,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord, | |||
* | |||
* @return Returns a SAM/BAM formatted RDD of reads, as well as the file header. | |||
*/ | |||
def convertToSam(isSorted: Boolean = false): (RDD[SAMRecordWritable], SAMFileHeader) = ConvertToSAM.time { | |||
def convertToSam(isSorted: Boolean = isSorted): (RDD[SAMRecordWritable], SAMFileHeader) = ConvertToSAM.time { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's going to be more logic that is needed here WRT dealing with records that are duplicated by the flanking process.
@@ -0,0 +1,72 @@ | |||
package org.bdgenomics.adam.rdd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running ./scripts/format-source
should add the license header.
*/ | ||
class SortedGenomicRDDSuite extends SparkFunSuite { | ||
|
||
def time[R](block: => R): R = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping a note here to remove this block before we merge.
@@ -186,7 +189,6 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log | |||
* @param filePath The (possibly globbed) filepath to load a VCF from. | |||
* @return Returns a tuple of metadata from the VCF header, including the | |||
* sequence dictionary and a list of the samples contained in the VCF. | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, check yer IDE settings, or use one that doesn't make unwanted changes on your behalf :)
* @param filename the filename for the metadata | ||
* @return a partition map if the data was written sorted, or an empty Seq if unsorted | ||
*/ | ||
private[rdd] def determineIsSortedAndExtractPartitionMap(filename: String): Seq[(ReferenceRegion, ReferenceRegion)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
determineIsSortedAndExtractPartitionMap
→ extractPartitionMap
val maybePartitionMap = metaDataMap.get("partitionMap") | ||
// we didn't write a partition map, which means this was not sorted at write | ||
// or at least we didn't have information that it was sorted | ||
if(maybePartitionMap.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace if(
→ if (
// Seq with commas separating each tuple of (ReferenceRegion, ReferenceRegion) | ||
// The first split is breaking up the tuples. Each tuple starts with a | ||
// "(" then a ReferenceRegion, so we are simply pulling out the tuples | ||
// by using the start of each tuple as the indicator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding ReferenceRegion
to the bdg-formats schemas and using Avro to write out the partition map? I don't see why we should be writing some stuff out as Parquet, some as Avro, some as JSON, and some as Strings or byte-serialized JVM objects.
VariantRDD(rdd, sd, headers, maybePartitionMapRdd = Some(sc.parallelize(pMap, pMap.length))) | ||
// if we have no information about partition map we assume unsorted | ||
} else { | ||
// default to isSorted = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isSorted
→ sorted
here and everywhere else
* @return A SortedGenomicRDDMixIn that contains the sorted and partitioned | ||
* RDD | ||
*/ | ||
def repartitionAndSortByGenomicCoordinate(partitions: Int = rdd.partitions.length)(implicit c: ClassTag[T]): U = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use the word coordinate elsewhere, how about repartitionAndSortByReferenceRegion
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just repartitionAndSort
to be consistent with the sort
/sortLexicographically
methods on GenomicRDD
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
replaceRdd(partitionedRDD.values, Some(newPartitionMapRdd)) | ||
} | ||
|
||
private[rdd] class GenomicPositionRangePartitioner[V](partitions: Int, elements: Int = 0) extends Partitioner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, "genomic position" isn't used elsewhere, why not ReferenceRegionRangePartitioner
?
@@ -70,7 +70,10 @@ private[rdd] object NucleotideContigFragmentRDD extends Serializable { | |||
*/ | |||
case class NucleotideContigFragmentRDD( | |||
rdd: RDD[NucleotideContigFragment], | |||
sequences: SequenceDictionary) extends AvroGenomicRDD[NucleotideContigFragment, NucleotideContigFragmentRDD] with ReferenceFile { | |||
sequences: SequenceDictionary, | |||
maybePartitionMapRdd: Option[RDD[(ReferenceRegion, ReferenceRegion)]] = None) extends AvroGenomicRDD[NucleotideContigFragment, NucleotideContigFragmentRDD] with ReferenceFile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybePartitionMapRdd
→ partitionMap
or partitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I'd prefer partitionMap
over partitions
as partitions
sounds more harmonious with Spark's abstract idea of what a Partition is, while we are abusing that notion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If partitionMap
remains an RDD (which you've argued against above), then it shouldn't really be called Map. :)
@@ -145,7 +145,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord, | |||
* file was saved. | |||
*/ | |||
private[rdd] def maybeSaveBam(args: ADAMSaveAnyArgs, | |||
isSorted: Boolean = false): Boolean = { | |||
isSorted: Boolean = isSorted): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isSorted
→ sorted
, as elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we have sorted
? I've generally been preferring isSorted
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not writing Java Beans. if (sorted)
reads much better to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that sorted
is the prevailing usage, BTW:
adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep isSorted {} \; | wc
23 158 1295
adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep sorted {} \; | wc
48 463 3178
This disparity grows if you include test sources:
adam fnothaft$ find adam-*/src -name "*.scala" -exec grep isSorted {} \; | wc
30 179 1459
adam fnothaft$ find adam-*/src -name "*.scala" -exec grep sorted {} \; | wc
135 839 7236
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that sorted
is the prevailing usage, BTW:
adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep isSorted {} \; | wc
23 158 1295
adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep sorted {} \; | wc
48 463 3178
This disparity grows if you include test sources:
adam fnothaft$ find adam-*/src -name "*.scala" -exec grep isSorted {} \; | wc
30 179 1459
adam fnothaft$ find adam-*/src -name "*.scala" -exec grep sorted {} \; | wc
135 839 7236
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #1341 to fix this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a note here to say that we decided to use isSorted
because of the potential collision with sorted
in scala.collection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our isSorted
is not 100% the same as the SAM/BAM definition of coordinate sorted. We'd need to filter out any replicated records.
@@ -105,7 +108,7 @@ case class VariantContextRDD(rdd: RDD[VariantContext], | |||
* Converts an RDD of ADAM VariantContexts to HTSJDK VariantContexts | |||
* and saves to disk as VCF. | |||
* | |||
* @param filePath The filepath to save to. | |||
* @param args The arguments for saving the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use full sentences for method parameter docs
ok to set the milestone for this as 0.22.0? |
+1 @heuermh, triaged to 0.22.0. |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains e244fcfc34d1580be73e9c76e6dc72ceced60383 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision e244fcfc34d1580be73e9c76e6dc72ceced60383 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f e244fcfc34d1580be73e9c76e6dc72ceced60383First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8f803afa8e862dff1626eb9002de0cb27175465a # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 8f803afa8e862dff1626eb9002de0cb27175465a (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8f803afa8e862dff1626eb9002de0cb27175465aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Cross-strand joins are handled by
Ah yes, things that are not on the same strand do not overlap. ;) This is intentional. IIRC, we clarified a lot of this in e98ee2d. I think the question here is less "what behavior is correct in ReferenceRegion.overlaps" and more "should stranded or unstranded reference regions be input into the region join"? Would you agree? CC @laserson who might have opinions as well. |
Seems like we should support both stranded and un-stranded. Could this simply be passed through to the operation that determines whether there is an overlap? |
I think there's a couple of ways we could do it. I think the simplest would be to generate unstranded reference regions as the keys to the join (after e98ee2d, this is v. easy). Perhaps we'd just pass this as a switch on the GenomicRDD methods? |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8ccd18912301e91573c5245dc5c09edcf11fc337 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 8ccd18912301e91573c5245dc5c09edcf11fc337 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8ccd18912301e91573c5245dc5c09edcf11fc337First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 2b68245 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 2b68245 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2b682452ac6d6453870022697b535357b023771dFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains c9aac611a0d99b8f3e2e6266369d1fcf6467ef46 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision c9aac611a0d99b8f3e2e6266369d1fcf6467ef46 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c9aac611a0d99b8f3e2e6266369d1fcf6467ef46First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b7204f839e6a69208114c630cf6cd025c13a6096 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision b7204f839e6a69208114c630cf6cd025c13a6096 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b7204f839e6a69208114c630cf6cd025c13a6096First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
|
||
val c = jRdd0.rdd.collect | ||
val c = jRdd.rdd.collect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed to be incorrect previously. Let me know if I should revert.
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains ec88406f303d32bcdaefa004bfd2c31ad748d080 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision ec88406f303d32bcdaefa004bfd2c31ad748d080 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f ec88406f303d32bcdaefa004bfd2c31ad748d080First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 46af637fa97d2727a0077052d50d55270b18b707 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 46af637fa97d2727a0077052d50d55270b18b707 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 46af637fa97d2727a0077052d50d55270b18b707First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains ab4cefe # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision ab4cefe (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f ab4cefeabf657914e554f7acd41e735144cfd5b2First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test PASSed. |
Thanks @devin-petersohn! I'm getting on a plane right now but should have time tomorrow AM to review. |
* we are only sorting a number of elements equal to the number of | ||
* partitions written. | ||
*/ | ||
private[rdd] class ADAMInputFormat[T] extends ParquetInputFormat[T] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename ADAMInputFormat
to ADAMParquetInputFormat
, à la ADAMBAMInputFormat
, ADAMVCFInputFormat
, etc?
Some(partitionMapBuilder.toArray) | ||
} catch { | ||
case e: FileNotFoundException => None | ||
// TODO: Log Exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO ;)
Should we log this or rethrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rethrowing is correct here.
* @param filename the filename for the metadata | ||
* @return a partition map if the data was written sorted, or an empty Seq if unsorted | ||
*/ | ||
private[rdd] def extractPartitionMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should move this into an avro schema or something more structured, instead of parsing JSON. That said, let's not do it here and now. Can you open a ticket to refactor this into avroland post 0.23.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
* Load a path name in Parquet + Avro format into an AlignmentRecordRDD. | ||
* | ||
* @note The sequence dictionary is read from an Avro file stored at | ||
* pathName/_seqdict.avro and the record group dictionary is read from an | ||
* Avro file stored at pathName/_rgdict.avro. These files are pure Avro, | ||
* not Parquet + Avro. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep space.
key match { | ||
case (_, f2: Int) => f2 | ||
case _ => { | ||
throw new Exception("Unable to partition without destination assignment") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should include the offending key in the exception message. Otherwise, the exception is impossible to debug if thrown.
}).filter(f => f._1.nonEmpty).map(f => (f._1.get, f._2)) | ||
.sortBy(elem => elem._1, ascending = true, numPartitions = partitions) | ||
|
||
partitionedRDD.cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were going to pull the partition mapping out of the sort code, so that people could sort without computing the partition map? Also, storage level should be parametrizable.
"Cannot copartition with an unsorted rdd!") | ||
|
||
val destinationPartitionMap = rddToCoPartitionWith.optPartitionMap.get | ||
//number of partitions we will have after repartition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits for the code in this function:
- there should be whitespace before a comment
- space after
//
// the zipWithIndex gives us the destination partition ID | ||
destinationPartitionMap.flatten.zipWithIndex.map(g => { | ||
val (firstRegion, secondRegion, index) = (g._1._1, g._1._2, g._2) | ||
// in the case where we span multiple referenceNames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit more documentation on this codepath? E.g., what is necessary to represent a partition, why is the first region sometimes enough, when isn't it, etc.
// includes any extremely long regions. we include the firstRegion for | ||
// the case that the first region is extremely long | ||
(iter ++ Iterator(firstRegion)).maxBy(f => (f._1.referenceName, f._1.end, f._1.start)) | ||
// only one record on this partition, so this is the extent of the bounds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this comment should be in the else
clause.
(lo to hi).map(i => ((region, i), y)) | ||
}) | ||
def compute(): RDD[(RT, RU)] = { | ||
leftRdd.zipPartitions(rightRdd, preservesPartitioning = true)(makeIterator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, this doesn't "preserve partitioning" as defined by Spark, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I doubt that it matters. It just lets Spark know that the partitioner it used (in this case ManualRegionPartitioner
) to partition the data is still valid after this operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my point: isn't the ManualRegionPartitioner
invalid after the zipPartitions
, since the "key" now has type RT
and the partitioner expects type (_, Int)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I will remove it.
* | ||
* @param rdd The underlying AlignmentRecord RDD. | ||
* @return A new AlignmentRecordRDD. | ||
*/ | ||
def unaligned(rdd: RDD[AlignmentRecord]): AlignmentRecordRDD = { | ||
AlignmentRecordRDD(rdd, | ||
SequenceDictionary.empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could call AlignmentRecordRDD(rdd, sequences, recordGroupDictionary, None)
directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even better, it could call the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I meant. Or is there a different syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See next 2 lines for solution
All the |
@@ -196,7 +196,7 @@ class VariantRDDSuite extends ADAMFunSuite { | |||
// we can't guarantee that we get exactly the number of partitions requested, | |||
// we get close though | |||
assert(jRdd.rdd.partitions.length === 1) | |||
assert(jRdd0.rdd.partitions.length === 5) | |||
assert(jRdd0.rdd.partitions.length === 4) | |||
|
|||
val c = jRdd0.rdd.collect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug. I will fix it.
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boom!
Test PASSed. |
@devin-petersohn you've got a conflict in |
Addressing reviewer comments Performance improvements related to decreasing compute Major refactor to clean up ShuffleRegionJoin Fixing serialization issue Cleaning up tests and code More cleanup Addressing reviewer comments Addressing reviewer comments, cleaning up after intellij Testing that data stays sorted on Jenkins Still debugging the sorted persistance Testing that we can persist sort Checking output to hopefully see what is happening Testing more related to sort persist Adding ADAMInputFormat class to ensure ordering is maintained Adding docs and cleanup Cleaning up code a bit Cleaning up after IntelliJ Addressing reviewer comments Addressing reviewer comments Addressing reviewer comments Clean up docs, cutting at 80 characters. Addressing reviewer comments Addressing reviewer comments. sorted to isSorted Fixing some spacing issues Addressing reviwer comments
536b936
to
7a9503c
Compare
@fnothaft Rebased, thanks! |
Test PASSed. |
Merged! Thanks @devin-petersohn! |
Ready for review.