Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1481] Refactor ADAMContext loadXxx methods for consistency #1487

Merged
merged 8 commits into from
Apr 19, 2017

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented Apr 12, 2017

Fixes #1481.

I flagged most of the pathName method parameter param doc strings with (todo: confirm) because it is not immediately clear which support glob/directories and which do not.

There is also some inconsistency wrt when .gz is supported and when it is not.

@coveralls
Copy link

coveralls commented Apr 13, 2017

Coverage Status

Coverage increased (+0.03%) to 81.667% when pulling 072ccf5 on heuermh:load-cleanup into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1943/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1944/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1487/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b99480b8032090dce76fdfb1cce900bb7800a5ff # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1487/merge^{commit} # timeout=10Checking out Revision b99480b8032090dce76fdfb1cce900bb7800a5ff (origin/pr/1487/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b99480b8032090dce76fdfb1cce900bb7800a5ffFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 81.508% when pulling 8082b2c on heuermh:load-cleanup into 93b32c6 on bigdatagenomics:master.

@coveralls
Copy link

coveralls commented Apr 13, 2017

Coverage Status

Coverage increased (+0.04%) to 81.674% when pulling 8082b2c on heuermh:load-cleanup into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1945/
Test PASSed.

@fnothaft
Copy link
Member

@heuermh After merging #1482, this branch has conflicts. Can you rebase?

@heuermh
Copy link
Member Author

heuermh commented Apr 13, 2017

Will do. I have more changes coming.

@fnothaft
Copy link
Member

Sounds good! Would you like me to hold off on reviewing until you're done with the changes?

@heuermh
Copy link
Member Author

heuermh commented Apr 13, 2017

If you have feedback on some of the open questions now, that would be helpful.

@fnothaft
Copy link
Member

What are the open questions? Are they the following? If so, answers below:

I flagged most of the pathName method parameter param doc strings with (todo: confirm) because it is not immediately clear which support glob/directories and which do not.

All of the loadX methods should support globs and directories.

There is also some inconsistency wrt when .gz is supported and when it is not.

  • GZIPed VCF is loaded (and even split if BGZIPed)
  • All of the files that use sc.textFile (FASTA, all of the feature formats) will load both GZIPed files (as a single split) and BZIP2 files
  • FileInputFormat (FASTQ) will load GZIPed data, but there is an issue, see Splittable compression and FASTQ #1457

I'm not sure about GZIPed SAM.

@heuermh
Copy link
Member Author

heuermh commented Apr 13, 2017

All of the loadX methods should support globs and directories.

Some clearly do not, loadIndexedBam, for example.

Others forward to loadAvro and sc.newAPIHadoopFile without going through getFsAndFiles and related, do those support globs and directories? E.g. loadInterleavedFastq forwards pathName: String to sc.newAPIHadoopFile.

There is also some inconsistency wrt when .gz is supported and when it is not.

Thanks, we need to make sure .gz and .bz2 extensions pass format guessing checks where appropriate then.

@fnothaft
Copy link
Member

All of the loadX methods should support globs and directories.

Some clearly do not, loadIndexedBam, for example.

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L369-L375 ;)

Others forward to loadAvro and sc.newAPIHadoopFile without going through getFsAndFiles and related, do those support globs and directories? E.g. loadInterleavedFastq forwards pathName: String to sc.newAPIHadoopFile.

In general, they should, unless the underlying InputFormat is doing something funny.

Thanks, we need to make sure .gz and .bz2 extensions pass format guessing checks where appropriate then.

+1

@heuermh
Copy link
Member Author

heuermh commented Apr 13, 2017

All of the loadX methods should support globs and directories.

Some clearly do not, loadIndexedBam, for example.

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L369-L375 ;)

Dang it! Who made that work and forgot to update the @param docs? :)

@fnothaft
Copy link
Member

Dang it! Who made that work and forgot to update the @param docs? :)

My bad! ;)

@coveralls
Copy link

coveralls commented Apr 13, 2017

Coverage Status

Coverage increased (+0.04%) to 81.674% when pulling 579126a on heuermh:load-cleanup into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1946/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse 579126a^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 579126a # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1487/head^{commit} # timeout=10Checking out Revision 579126a (origin/pr/1487/head) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 579126af17f2b882c715080dd75e309bfda3ebf5First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls
Copy link

coveralls commented Apr 14, 2017

Coverage Status

Coverage increased (+0.3%) to 81.854% when pulling c7814c9 on heuermh:load-cleanup into 6c32a02 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1947/
Test PASSed.

@coveralls
Copy link

coveralls commented Apr 14, 2017

Coverage Status

Coverage increased (+0.2%) to 81.809% when pulling f3d1d3c on heuermh:load-cleanup into 6c32a02 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1948/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Apr 14, 2017

This one is ready to review!

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check for compressors using the Hadoop factory, but otherwise this LGTM! Thanks @heuermh! This is a great cleanup.

* @param pathName The path name to load SAM/BAM/CRAM formatted alignments from. Globs/directories are
* supported (todo: confirm).
* @param pathName The path name to load SAM/BAM/CRAM formatted alignment records from.
* Globs/directories are supported (todo: confirm).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm; globs/directories are supported.

*/
private[adam] def isBedExt(pathName: String): Boolean = {
pathName.endsWith(".bed") ||
pathName.endsWith(".bed.gz") ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, for all of the compressed file format checks, we should use Hadoop's CompressionCodecFactory to check if the path is a compressed file. This allows users to swap in additional compression codecs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I suppose we could do an isCompressed check in ADAMContext, since it has access to the Hadoop configuration, then trim off the extension before checking with FileExtensions

@coveralls
Copy link

coveralls commented Apr 18, 2017

Coverage Status

Coverage increased (+0.4%) to 81.942% when pulling b021e2c on heuermh:load-cleanup into 6c32a02 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1954/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM from a code perspective, but I think we should mention in each of the docstrings that we do support compressed text files, and where that comes from. I think we're very close here; thanks for driving this @heuermh!

* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format.
* * .fa/.fasta as FASTA format,
* * .fq/.fastq as FASTQ format, and
* * .ifq as interleaved FASTQ format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention that we read the configured compression codecs from the Hadoop config for this, which by default include .gz and .bz2, but can include more.

* If the path name has a .fa/.fa.gz/.fa.bz2/.fasta/.fasta.gz/.fasta.bz2 extension,
* load as FASTA format. Else, fall back to Parquet + Avro.
* If the path name has a .fa/.fasta extension, load as FASTA format.
* Else, fall back to Parquet + Avro.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -92,7 +92,7 @@ class JavaADAMContext(val ac: ADAMContext) extends Serializable {
*
* Loads path names ending in:
* * .bam/.cram/.sam as BAM/CRAM/SAM format and
* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format.
* * .ifq as interleaved FASTQ format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

* * .gff3 as GFF3 format,
* * .gtf/.gff as GTF/GFF2 format,
* * .narrow[pP]eak as NarrowPeak format, and
* * .interval_list as IntervalList format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

* * .gff3 as GFF3 format,
* * .gtf/.gff as GTF/GFF2 format,
* * .narrow[pP]eak as NarrowPeak format, and
* * .interval_list as IntervalList format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

* * .gff3 as GFF3 format,
* * .gtf/.gff as GTF/GFF2 format,
* * .narrow[pP]eak as NarrowPeak format, and
* * .interval_list as IntervalList format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

* * .gff3 as GFF3 format,
* * .gtf/.gff as GTF/GFF2 format,
* * .narrow[pP]eak as NarrowPeak format, and
* * .interval_list as IntervalList format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format.
* * .fa/.fasta as FASTA format,
* * .fq/.fastq as FASTQ format, and
* * .ifq as interleaved FASTQ format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -1721,7 +1744,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
*
* Loads path names ending in:
* * .bam/.cram/.sam as BAM/CRAM/SAM format and
* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format.
* * .ifq as interleaved FASTQ format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@coveralls
Copy link

coveralls commented Apr 19, 2017

Coverage Status

Coverage increased (+0.4%) to 82.018% when pulling 9b13e6e on heuermh:load-cleanup into 6c32a02 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1955/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do you want to squash in any specific way, or would you like me to hit the green squash button?

@heuermh
Copy link
Member Author

heuermh commented Apr 19, 2017

Green squash is fine!

@fnothaft fnothaft merged commit 93d17b4 into bigdatagenomics:master Apr 19, 2017
@heuermh heuermh deleted the load-cleanup branch April 19, 2017 17:03
@heuermh
Copy link
Member Author

heuermh commented Apr 19, 2017

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants