Resolves various single file save/header issues #1104

fnothaft · 2016-08-08T20:29:23Z

AmplabJenkins · 2016-08-08T20:55:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1372/
Test PASSed.

fnothaft · 2016-08-10T15:51:00Z

Resolves #1106 now too. Ready for review/merge.

AmplabJenkins · 2016-08-10T16:16:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1376/
Test PASSed.

heuermh · 2016-08-11T04:21:12Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureRDD.scala

+    val start = feature.getStart + 1 // IntervalList ranges are 1-based
+    val end = feature.getEnd // IntervalList ranges are closed
+    val strand = Features.asString(feature.getStrand, emptyUnknown = false)
+    val attributes = Features.gatherAttributes(feature)


This goes back to shoving all attributes into the IntervalList format name column, right? I saw that as an abuse to the format and generated a name with reasonable defaults, see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/Features.scala#L207.

I'm not 100% here—I'm a bit unfamiliar with the IntervalList format, even after this PR—but the implementation here passes round trip (with the exception of some ordering bits) with the interval list file in our test resources directory. For me to understand the difference a bit better, could you perhaps elucidate how this code would change the output? E.g., how would the resultant interval list line look different?

The javadoc for IntervalList (all the spec I'm aware of) describes the last column as "Interval name (an, ideally unique, name for the interval)". I haven't seen many of these files in the wild, so I don't know if | delimited attributes in the Interval name column is a reasonable thing to do.

Thus I found the previous code somewhat unsavory and arbitrary (as some hard coded keys are shoved into dbxrefs), see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala#L199.

I hadn't got to round tripping; I guess I was planning to shove the Interval name column into Feature.name field when reading IntervalList format and use Features.nameOf when writing. And then find a different file for the unit test.

Just wanted to get back to this, the reason I've shoved records in here is that this is the format we got in the interval list file that described the exome panel. I don't know what the "best" approach is, but I don't think interval list is a rigorously specified format, so we're somewhat making things up as we go here.

Yeah, without seeing more examples of these in the wild, I'm not sure if the name column should be mapped to Feature.name or to Feature.attributes.

ok with me for now, will resolve in future issue

heuermh · 2016-08-11T05:00:08Z

Looks pretty good. Will try it out on various files tomorrow (and also hopefully re-remember how to use our internal cluster).

jpdna · 2016-08-19T02:25:00Z

I attempted a round-trip BAM file conversion like:

adam-submit transform testb.bam testb.adam
adam-submit transform testb.adam testb.roundtrip.bam -single

and although the resulting testb.roundtrip.bam is 104MB ( the original was 101MB) running samtools flagstat testb.roundtrip.bam reports 0 counts for all stats, while there were 984817 reads reported by flagstat on the orignal file.

Also samtools view -h -o testb.roundtrip.conver2sam.sam testb.roundtrip.bam will not produce any reads data in an output sam file, only the header. So although no corruption error is logged by samtools, the file must be corrupt from the point of view of samtools 1.3

When not using -single
adam-submit transform testb.adam testb.roundtrip.bam part-r-0000* parquet files appear to be correctly generated.

If you have trouble replicating - let me know and I'll get you the file.

fnothaft · 2016-08-19T03:48:23Z

Thanks for looking at this @jpdna! By any chance, have you tried running any of the Picard metrics tools on the file? Our round trip SAM/BAM unit tests are passing, so I believe that htsjdk is able to read the file fine. I wonder if we're writing the BAM header in a way so that htsjdk and htslib read the file differently? I'll take a look tomorrow.

jpdna · 2016-08-19T04:34:03Z

Interestingly - Picard seems to not like even the ORIGINAL test input file

java -jar ~/Apps/picard/release_2.2.2/picard-tools-2.2.2/picard.jar SamToFastq I=testb.bam F=out.fastq SECOND_END_FASTQ=out_second_end.fastq
[Fri Aug 19 00:02:08 EDT 2016] picard.sam.SamToFastq INPUT=testb.bam FASTQ=out.fastq SECOND_END_FASTQ=out_second_end.fastq    OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Aug 19 00:02:08 EDT 2016] Executing as jp@jp1 on Linux 4.3.5-300.fc23.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15; Picard version: 2.2.2(20d49152d0840a960fcb97df76dbaca260b39244_1461168806) IntelDeflater
[Fri Aug 19 00:02:09 EDT 2016] picard.sam.SamToFastq done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2024275968
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 40060, Read name ERR001301.6170436, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
    at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:441)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:652)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:637)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:607)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
    at picard.sam.SamToFastq.doWork(SamToFastq.java:174)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Here is the original bam file I am feeding into this round-trip test:
https://drive.google.com/file/d/0B6jh69UgixwpVVJwdE1qTXhjcEk/view?usp=sharing

I'd write this off to just an incorrect BAM file that samtools is too permissive on, but the test file was derived from the 1000 genomes files available here:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam

I also just tried that whole 1000 genomes with the Picard command above, and Picard produces the same error, but samtools flagstat and view seems to work fine on it.

So - it would be interesting to know exactly what is up with these 1000 genomes files that makes htsjdk unable to read them ( related to "bin" field I guess) - and thus ADAM can't either, and see if this jdk error message can at least be passed on more clearly in ADAM rather than just dropping the reads - if indeed it is legitimate to refuse to process these.

jpdna · 2016-08-19T14:10:20Z

I googled a bit and find that setting VALIDATION_STRINGENCY=LENIENT when running picard allows picard to process this 1000 genomes file file.

 java -jar ~/Apps/picard/release_2.2.2/picard-tools-2.2.2/picard.jar SamToFastq I=NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam F=out.fastq SECOND_END_FASTQ=out_second_end.fastq VALIDATION_STRINGENCY=LENIENT

as discussed here : http://gatkforums.broadinstitute.org/gatk/discussion/4290/sam-bin-field-error-for-the-gatk-run

So perhaps we need to see if we can allow the same "lenient" mode in ADAM?

fnothaft · 2016-08-19T14:20:10Z

So perhaps we need to see if we can allow the same "lenient" mode in ADAM?

I don't think that applies for ADAM, as we're not using the index.

jpdna · 2016-08-19T15:40:25Z

The ADAM parquet file testb.adam examined with adam-submit flagstat testb.adam does appear fine and agrees withsamtools flagstat of the original test.bam.

Also looking further at the per partition "bam" files named part-r-0000* produced by adam-submit transform testb.adam testb.roundtrip.NOSINGLE.bam without -single are reporting reasonable samtools flagstat count of reads, though the
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
warning is logged by samtools.

jpdna · 2016-08-19T17:11:15Z

Trying to output a -single sam file (rather than bam) works fine as:
adam-submit transform testb.adam testb.rountrip.sam -single
and samtools is then happy to operate on that sam file, including converting it to bam.

Also - I did same test and got same result of samtools not being to read the round tripped bam when using the unsorted.sam file we use in our unit tests.

adam-submit transform unsorted.sam unsorted.adam
../adam/bin/adam-submit transform unsorted.adam unsorted.roundtrip.bam -single


samtools flagstat unsorted.roundtrip.bam
0 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (N/A : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

fnothaft · 2016-08-19T17:47:07Z

<3 htslib/htsjdk format disagreements

@jpdna thanks for testing this on samtools. I'll go ahead and debug this further. As an aside, we should add an integration test that reads the files in from samtools, or some other tool that uses htslib. I know that htslib and htsjdk also disagree on VCF, but this one is somewhat shocking. If we can root cause the issue, we should report it upstream.

jpdna · 2016-08-19T17:52:02Z

I'd be happy to have a go at making such integration test with samtools, I'll open an issue. We can then test the fix with it when ready.

fnothaft · 2016-08-19T17:52:40Z

Awesome! Thanks @jpdna!

fnothaft · 2016-08-23T04:20:52Z

I'm looking at this now. It looks like we write the header completely correctly. I'm not sure what exactly is going on, so I'm unrolling some of the code so that we can write uncompressed BAM (at least for the header) so that I can look at a hexdump to see what's going wrong.

Otherwise, we've got some bad stuff going on in the AlignmentRecordRDD round trip tests, so I'm going to patch this up in this PR as well.

jpdna · 2016-08-30T18:25:36Z

Just pinging on this @fnothaft - I think you are continuing to dig through the hexdumps?

fnothaft · 2016-08-30T22:13:48Z

Thanks for the ping @jpdna ; let me get back on this...

fnothaft · 2016-08-30T22:34:29Z

Just rebased and pushed a fix for the samtools issue. It is isolated in 74928e6, I'll squash that down into one of the other commits later. I've tested this locally on a small file with samtools. @jpdna can you check this on your side as well?

AmplabJenkins · 2016-08-30T22:57:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1438/
Test PASSed.

fnothaft · 2016-08-31T23:04:06Z

Ping for review/merge?

heuermh · 2016-08-31T23:14:52Z

I'm not sure we resolved the IntervalList format name column discussion above, and a few of the other review comments still need addressing.

On my end, I want to run through some tests on the cluster.

fnothaft · 2016-08-31T23:23:42Z

Ah, sorry! I did a quick pass over the PR earlier and thought I'd addressed the comments. Evidently, I read way too fast. I'll make a pass over those tomorrow.

fnothaft · 2016-09-06T22:06:14Z

@heuermh just rebased and addressed your review comments. Let me know your thoughts.

AmplabJenkins · 2016-09-06T22:33:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1460/
Test PASSed.

heuermh · 2016-09-06T22:36:14Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala

-      log.warn("Empty or invalid NarrowPeak line: {}", line)
+      if (stringency == ValidationStringency.STRICT) {
+        throw new IllegalArgumentException("Empty or invalid NarrowPeak line: %s".format(line))
+      } else if (stringency == ValidationStringency.STRICT) {


this should be LENIENT

Just patched, thanks for catching.

fnothaft · 2016-09-06T22:45:33Z

@heuermh if this looks good, can you merge? Merging this will simplify #1145.

heuermh · 2016-09-06T22:51:03Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala

@@ -169,18 +179,28 @@ class GFF3Parser extends FeatureParser {
 * In lieu of a specification, see:
 * https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/util/IntervalList.html
 */
-class IntervalListParser extends Serializable {
-  def parse(line: String): (Option[SequenceRecord], Option[Feature]) = {
+private[rdd] class IntervalListParser extends Serializable with Logging {


Since we are reusing the SAM header code for IntervalList format I was thinking we could remove all the startsWith("@") stuff here.

This method should only need to handle the five tab-delimited columns (Sequence name, Start position (1-based), End position (1-based, end inclusive), Strand (either + or -), Interval name (an, ideally unique, name for the interval).

We're reusing the SAM header writing code, not the SAM header reading code. We might be able to reuse that code, but that's less straightforward. Perhaps we can punt that to a later ticket?

Essentially, I just don't have the bandwidth to take that on now.

ok, how about just getting rid of the arbitrary dbxref key stuff at

https://github.com/fnothaft/adam/blob/f21ead579471ab7854aa519ee7efc884f2f6f38b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala#L226

then? Don't set dbxrefs at all, just attributes.

I assume this was meant to be on a different line? That change tracks all the way back to #686. I'd want to get @ryan-williams' thoughts on that code before we yanked it out or changed it. I think we're getting into a debate that's somewhat orthogonal to this PR, which is what exactly the .interval_list format is. I've just added code to said parser so that it passes round trip tests. If we'd like to change the parser, I think we should open an issue and discuss that there, and not block this PR on it. That's just my opinion though. Perhaps let's discuss on the call tomorrow?

Let's create an issue to 1) look into using SAM header code when reading IntervalList format and 2) clean up this dbxref key stuff

heuermh · 2016-09-06T22:53:05Z

Reviewing meow

AmplabJenkins · 2016-09-06T22:57:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1461/
Test PASSed.

heuermh · 2016-09-07T17:34:30Z

Created a new issue #1152 to resolve IntervalList discussions later.

Is this pull request ready to merge then?

fnothaft · 2016-09-07T18:03:26Z

Just rebased. Should be good for a merge once tests pass.

Factor single file save code out of AlignmentRecordRDD. Resolves bigdatagenomics#1009.

…List. Resolves bigdatagenomics#1059. Factored out SAM Header writing code, used it to implement single file save for IntervalList files.

Resolves bigdatagenomics#1058. Added `asSingleFile` to all legacy Feature format save methods in `org.bdgenomics.adam.rdd.features.FeatureRDD`. Renamed `Features2ADAM` to `TransformFeatures` and added the `-single` option, as well as the ability to export back to legacy feature formats.

Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and VCF.

Resolves bigdatagenomics#1106. Improves round trip coverage by checking feature counts after writing out, and spot checking converted feature lines. * Split out all string encoding functions for features to make them easier to test, and wrote tests to spot check the correctness of these features. * Made end-to-end feature tests pass. * Eliminated some redundant tests. * Added assertions to check that the same number of features were present in the input and output files when writing between different formats. * Rewrote interval list parser and encoder. * Fixed various bugs: * GFF2/GTF/GFF3 parser would discard lines that lacked an attribute column. * NarrowPeak parser would discard INDEPENDENT strand. * NarrowPeak signal intensities should be written as Ints, not Floats.

AmplabJenkins · 2016-09-07T18:38:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1466/
Test PASSed.

heuermh · 2016-09-07T18:43:41Z

Merged via multiple commits.

Thank you, @fnothaft!

fnothaft mentioned this pull request Aug 9, 2016

Improve coverage in Feature unit tests #1106

Closed

heuermh reviewed Aug 11, 2016
View reviewed changes

jpdna mentioned this pull request Aug 19, 2016

Samtools (htslib) integration testing #1120

Closed

fnothaft force-pushed the issues/1009-single-file-save branch from 17cea3a to 9bd6785 Compare August 30, 2016 22:33

fnothaft force-pushed the issues/1009-single-file-save branch from 9bd6785 to 74928e6 Compare August 30, 2016 22:34

fnothaft mentioned this pull request Sep 1, 2016

[ADAM-1141] Add support for saving/loading AlignmentRecords to/from CRAM. #1145

Merged

fnothaft force-pushed the issues/1009-single-file-save branch from 74928e6 to 740e520 Compare September 6, 2016 22:05

heuermh reviewed Sep 6, 2016
View reviewed changes

fnothaft force-pushed the issues/1009-single-file-save branch from 740e520 to f21ead5 Compare September 6, 2016 22:37

heuermh reviewed Sep 6, 2016
View reviewed changes

fnothaft force-pushed the issues/1009-single-file-save branch from f21ead5 to 79749ae Compare September 7, 2016 18:02

fnothaft added 7 commits September 7, 2016 11:06

[ADAM-1009] Port single file save code over to VCF.

d5221a5

Factor single file save code out of AlignmentRecordRDD. Resolves bigdatagenomics#1009.

[ADAM-1059] Implemented single file save with SAM header for Interval…

06180e0

…List. Resolves bigdatagenomics#1059. Factored out SAM Header writing code, used it to implement single file save for IntervalList files.

[ADAM-676] Clean up header issues for sharded files.

48caa3f

Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and VCF.

Fixed samtools reading issue.

3b99446

Addressing review comments.

79749ae

heuermh merged commit 79749ae into bigdatagenomics:master Sep 7, 2016

Resolves various single file save/header issues #1104

Resolves various single file save/header issues #1104

Conversation

fnothaft commented Aug 8, 2016

AmplabJenkins commented Aug 8, 2016

fnothaft commented Aug 10, 2016

AmplabJenkins commented Aug 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh commented Aug 11, 2016

jpdna commented Aug 19, 2016 • edited Loading

fnothaft commented Aug 19, 2016

jpdna commented Aug 19, 2016

jpdna commented Aug 19, 2016

fnothaft commented Aug 19, 2016

jpdna commented Aug 19, 2016

jpdna commented Aug 19, 2016 • edited Loading

fnothaft commented Aug 19, 2016

jpdna commented Aug 19, 2016

fnothaft commented Aug 19, 2016

fnothaft commented Aug 23, 2016

jpdna commented Aug 30, 2016

fnothaft commented Aug 30, 2016

fnothaft commented Aug 30, 2016 • edited Loading

AmplabJenkins commented Aug 30, 2016

fnothaft commented Aug 31, 2016

heuermh commented Aug 31, 2016

fnothaft commented Aug 31, 2016

fnothaft commented Sep 6, 2016

AmplabJenkins commented Sep 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft commented Sep 6, 2016

heuermh Sep 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh commented Sep 6, 2016

AmplabJenkins commented Sep 6, 2016

heuermh commented Sep 7, 2016

fnothaft commented Sep 7, 2016

AmplabJenkins commented Sep 7, 2016

heuermh commented Sep 7, 2016

jpdna commented Aug 19, 2016 •

edited

Loading

jpdna commented Aug 19, 2016 •

edited

Loading

fnothaft commented Aug 30, 2016 •

edited

Loading

heuermh Sep 6, 2016 •

edited

Loading