-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAM-1481] Refactor ADAMContext loadXxx methods for consistency #1487
Conversation
Test PASSed. |
Test FAILed. Build result: FAILURE[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1487/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b99480b8032090dce76fdfb1cce900bb7800a5ff # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1487/merge^{commit} # timeout=10Checking out Revision b99480b8032090dce76fdfb1cce900bb7800a5ff (origin/pr/1487/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b99480b8032090dce76fdfb1cce900bb7800a5ffFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test PASSed. |
Will do. I have more changes coming. |
Sounds good! Would you like me to hold off on reviewing until you're done with the changes? |
If you have feedback on some of the open questions now, that would be helpful. |
What are the open questions? Are they the following? If so, answers below:
All of the
I'm not sure about GZIPed SAM. |
Some clearly do not, Others forward to
Thanks, we need to make sure |
In general, they should, unless the underlying InputFormat is doing something funny.
+1 |
Dang it! Who made that work and forgot to update the |
Test FAILed. Build result: FAILURE[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse 579126a^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 579126a # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1487/head^{commit} # timeout=10Checking out Revision 579126a (origin/pr/1487/head) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 579126af17f2b882c715080dd75e309bfda3ebf5First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test PASSed. |
Test PASSed. |
This one is ready to review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check for compressors using the Hadoop factory, but otherwise this LGTM! Thanks @heuermh! This is a great cleanup.
* @param pathName The path name to load SAM/BAM/CRAM formatted alignments from. Globs/directories are | ||
* supported (todo: confirm). | ||
* @param pathName The path name to load SAM/BAM/CRAM formatted alignment records from. | ||
* Globs/directories are supported (todo: confirm). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm; globs/directories are supported.
*/ | ||
private[adam] def isBedExt(pathName: String): Boolean = { | ||
pathName.endsWith(".bed") || | ||
pathName.endsWith(".bed.gz") || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, for all of the compressed file format checks, we should use Hadoop's CompressionCodecFactory to check if the path is a compressed file. This allows users to swap in additional compression codecs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I suppose we could do an isCompressed check in ADAMContext
, since it has access to the Hadoop configuration, then trim off the extension before checking with FileExtensions
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM from a code perspective, but I think we should mention in each of the docstrings that we do support compressed text files, and where that comes from. I think we're very close here; thanks for driving this @heuermh!
* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format. | ||
* * .fa/.fasta as FASTA format, | ||
* * .fq/.fastq as FASTQ format, and | ||
* * .ifq as interleaved FASTQ format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mention that we read the configured compression codecs from the Hadoop config for this, which by default include .gz and .bz2, but can include more.
* If the path name has a .fa/.fa.gz/.fa.bz2/.fasta/.fasta.gz/.fasta.bz2 extension, | ||
* load as FASTA format. Else, fall back to Parquet + Avro. | ||
* If the path name has a .fa/.fasta extension, load as FASTA format. | ||
* Else, fall back to Parquet + Avro. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
@@ -92,7 +92,7 @@ class JavaADAMContext(val ac: ADAMContext) extends Serializable { | |||
* | |||
* Loads path names ending in: | |||
* * .bam/.cram/.sam as BAM/CRAM/SAM format and | |||
* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format. | |||
* * .ifq as interleaved FASTQ format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
* * .gff3 as GFF3 format, | ||
* * .gtf/.gff as GTF/GFF2 format, | ||
* * .narrow[pP]eak as NarrowPeak format, and | ||
* * .interval_list as IntervalList format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
* * .gff3 as GFF3 format, | ||
* * .gtf/.gff as GTF/GFF2 format, | ||
* * .narrow[pP]eak as NarrowPeak format, and | ||
* * .interval_list as IntervalList format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
* * .gff3 as GFF3 format, | ||
* * .gtf/.gff as GTF/GFF2 format, | ||
* * .narrow[pP]eak as NarrowPeak format, and | ||
* * .interval_list as IntervalList format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
* * .gff3 as GFF3 format, | ||
* * .gtf/.gff as GTF/GFF2 format, | ||
* * .narrow[pP]eak as NarrowPeak format, and | ||
* * .interval_list as IntervalList format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format. | ||
* * .fa/.fasta as FASTA format, | ||
* * .fq/.fastq as FASTQ format, and | ||
* * .ifq as interleaved FASTQ format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
@@ -1721,7 +1744,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log | |||
* | |||
* Loads path names ending in: | |||
* * .bam/.cram/.sam as BAM/CRAM/SAM format and | |||
* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format. | |||
* * .ifq as interleaved FASTQ format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Do you want to squash in any specific way, or would you like me to hit the green squash button?
Green squash is fine! |
Thank you! |
Fixes #1481.
I flagged most of the
pathName
method parameter param doc strings with(todo: confirm)
because it is not immediately clear which support glob/directories and which do not.There is also some inconsistency wrt when
.gz
is supported and when it is not.