Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolves various single file save/header issues #1104

Merged

Conversation

fnothaft
Copy link
Member

@fnothaft fnothaft commented Aug 8, 2016

Resolves #1009, #1059, #1058, #676.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1372/
Test PASSed.

@fnothaft
Copy link
Member Author

Resolves #1106 now too. Ready for review/merge.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1376/
Test PASSed.

val start = feature.getStart + 1 // IntervalList ranges are 1-based
val end = feature.getEnd // IntervalList ranges are closed
val strand = Features.asString(feature.getStrand, emptyUnknown = false)
val attributes = Features.gatherAttributes(feature)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back to shoving all attributes into the IntervalList format name column, right? I saw that as an abuse to the format and generated a name with reasonable defaults, see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/Features.scala#L207.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% here—I'm a bit unfamiliar with the IntervalList format, even after this PR—but the implementation here passes round trip (with the exception of some ordering bits) with the interval list file in our test resources directory. For me to understand the difference a bit better, could you perhaps elucidate how this code would change the output? E.g., how would the resultant interval list line look different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc for IntervalList (all the spec I'm aware of) describes the last column as "Interval name (an, ideally unique, name for the interval)". I haven't seen many of these files in the wild, so I don't know if | delimited attributes in the Interval name column is a reasonable thing to do.

Thus I found the previous code somewhat unsavory and arbitrary (as some hard coded keys are shoved into dbxrefs), see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala#L199.

I hadn't got to round tripping; I guess I was planning to shove the Interval name column into Feature.name field when reading IntervalList format and use Features.nameOf when writing. And then find a different file for the unit test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to get back to this, the reason I've shoved records in here is that this is the format we got in the interval list file that described the exome panel. I don't know what the "best" approach is, but I don't think interval list is a rigorously specified format, so we're somewhat making things up as we go here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, without seeing more examples of these in the wild, I'm not sure if the name column should be mapped to Feature.name or to Feature.attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with me for now, will resolve in future issue

@heuermh
Copy link
Member

heuermh commented Aug 11, 2016

Looks pretty good. Will try it out on various files tomorrow (and also hopefully re-remember how to use our internal cluster).

@jpdna
Copy link
Member

jpdna commented Aug 19, 2016

I attempted a round-trip BAM file conversion like:

adam-submit transform testb.bam testb.adam
adam-submit transform testb.adam testb.roundtrip.bam -single

and although the resulting testb.roundtrip.bam is 104MB ( the original was 101MB) running samtools flagstat testb.roundtrip.bam reports 0 counts for all stats, while there were 984817 reads reported by flagstat on the orignal file.

Also samtools view -h -o testb.roundtrip.conver2sam.sam testb.roundtrip.bam will not produce any reads data in an output sam file, only the header. So although no corruption error is logged by samtools, the file must be corrupt from the point of view of samtools 1.3

When not using -single
adam-submit transform testb.adam testb.roundtrip.bam part-r-0000* parquet files appear to be correctly generated.

If you have trouble replicating - let me know and I'll get you the file.

@fnothaft
Copy link
Member Author

Thanks for looking at this @jpdna! By any chance, have you tried running any of the Picard metrics tools on the file? Our round trip SAM/BAM unit tests are passing, so I believe that htsjdk is able to read the file fine. I wonder if we're writing the BAM header in a way so that htsjdk and htslib read the file differently? I'll take a look tomorrow.

@jpdna
Copy link
Member

jpdna commented Aug 19, 2016

Interestingly - Picard seems to not like even the ORIGINAL test input file

java -jar ~/Apps/picard/release_2.2.2/picard-tools-2.2.2/picard.jar SamToFastq I=testb.bam F=out.fastq SECOND_END_FASTQ=out_second_end.fastq
[Fri Aug 19 00:02:08 EDT 2016] picard.sam.SamToFastq INPUT=testb.bam FASTQ=out.fastq SECOND_END_FASTQ=out_second_end.fastq    OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Aug 19 00:02:08 EDT 2016] Executing as jp@jp1 on Linux 4.3.5-300.fc23.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15; Picard version: 2.2.2(20d49152d0840a960fcb97df76dbaca260b39244_1461168806) IntelDeflater
[Fri Aug 19 00:02:09 EDT 2016] picard.sam.SamToFastq done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2024275968
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 40060, Read name ERR001301.6170436, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
    at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:441)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:652)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:637)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:607)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
    at picard.sam.SamToFastq.doWork(SamToFastq.java:174)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Here is the original bam file I am feeding into this round-trip test:
https://drive.google.com/file/d/0B6jh69UgixwpVVJwdE1qTXhjcEk/view?usp=sharing

I'd write this off to just an incorrect BAM file that samtools is too permissive on, but the test file was derived from the 1000 genomes files available here:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam

I also just tried that whole 1000 genomes with the Picard command above, and Picard produces the same error, but samtools flagstat and view seems to work fine on it.

So - it would be interesting to know exactly what is up with these 1000 genomes files that makes htsjdk unable to read them ( related to "bin" field I guess) - and thus ADAM can't either, and see if this jdk error message can at least be passed on more clearly in ADAM rather than just dropping the reads - if indeed it is legitimate to refuse to process these.

@jpdna
Copy link
Member

jpdna commented Aug 19, 2016

I googled a bit and find that setting VALIDATION_STRINGENCY=LENIENT when running picard allows picard to process this 1000 genomes file file.

 java -jar ~/Apps/picard/release_2.2.2/picard-tools-2.2.2/picard.jar SamToFastq I=NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam F=out.fastq SECOND_END_FASTQ=out_second_end.fastq VALIDATION_STRINGENCY=LENIENT

as discussed here : http://gatkforums.broadinstitute.org/gatk/discussion/4290/sam-bin-field-error-for-the-gatk-run

So perhaps we need to see if we can allow the same "lenient" mode in ADAM?

@fnothaft
Copy link
Member Author

So perhaps we need to see if we can allow the same "lenient" mode in ADAM?

I don't think that applies for ADAM, as we're not using the index.

@jpdna
Copy link
Member

jpdna commented Aug 19, 2016

The ADAM parquet file testb.adam examined with adam-submit flagstat testb.adam does appear fine and agrees withsamtools flagstat of the original test.bam.

Also looking further at the per partition "bam" files named part-r-0000* produced by adam-submit transform testb.adam testb.roundtrip.NOSINGLE.bam without -single are reporting reasonable samtools flagstat count of reads, though the
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
warning is logged by samtools.

@jpdna
Copy link
Member

jpdna commented Aug 19, 2016

Trying to output a -single sam file (rather than bam) works fine as:
adam-submit transform testb.adam testb.rountrip.sam -single
and samtools is then happy to operate on that sam file, including converting it to bam.

Also - I did same test and got same result of samtools not being to read the round tripped bam when using the unsorted.sam file we use in our unit tests.

adam-submit transform unsorted.sam unsorted.adam
../adam/bin/adam-submit transform unsorted.adam unsorted.roundtrip.bam -single


samtools flagstat unsorted.roundtrip.bam
0 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (N/A : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

@fnothaft
Copy link
Member Author

<3 htslib/htsjdk format disagreements

@jpdna thanks for testing this on samtools. I'll go ahead and debug this further. As an aside, we should add an integration test that reads the files in from samtools, or some other tool that uses htslib. I know that htslib and htsjdk also disagree on VCF, but this one is somewhat shocking. If we can root cause the issue, we should report it upstream.

@jpdna
Copy link
Member

jpdna commented Aug 19, 2016

I'd be happy to have a go at making such integration test with samtools, I'll open an issue. We can then test the fix with it when ready.

@fnothaft
Copy link
Member Author

Awesome! Thanks @jpdna!

@fnothaft
Copy link
Member Author

I'm looking at this now. It looks like we write the header completely correctly. I'm not sure what exactly is going on, so I'm unrolling some of the code so that we can write uncompressed BAM (at least for the header) so that I can look at a hexdump to see what's going wrong.

Otherwise, we've got some bad stuff going on in the AlignmentRecordRDD round trip tests, so I'm going to patch this up in this PR as well.

@jpdna
Copy link
Member

jpdna commented Aug 30, 2016

Just pinging on this @fnothaft - I think you are continuing to dig through the hexdumps?

@fnothaft
Copy link
Member Author

Thanks for the ping @jpdna ; let me get back on this...

@fnothaft fnothaft force-pushed the issues/1009-single-file-save branch from 17cea3a to 9bd6785 Compare August 30, 2016 22:33
@fnothaft
Copy link
Member Author

fnothaft commented Aug 30, 2016

Just rebased and pushed a fix for the samtools issue. It is isolated in 74928e6, I'll squash that down into one of the other commits later. I've tested this locally on a small file with samtools. @jpdna can you check this on your side as well?

@fnothaft fnothaft force-pushed the issues/1009-single-file-save branch from 9bd6785 to 74928e6 Compare August 30, 2016 22:34
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1438/
Test PASSed.

@fnothaft
Copy link
Member Author

Ping for review/merge?

@heuermh
Copy link
Member

heuermh commented Aug 31, 2016

I'm not sure we resolved the IntervalList format name column discussion above, and a few of the other review comments still need addressing.

On my end, I want to run through some tests on the cluster.

@fnothaft
Copy link
Member Author

Ah, sorry! I did a quick pass over the PR earlier and thought I'd addressed the comments. Evidently, I read way too fast. I'll make a pass over those tomorrow.

@fnothaft
Copy link
Member Author

fnothaft commented Sep 6, 2016

@heuermh just rebased and addressed your review comments. Let me know your thoughts.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1460/
Test PASSed.

log.warn("Empty or invalid NarrowPeak line: {}", line)
if (stringency == ValidationStringency.STRICT) {
throw new IllegalArgumentException("Empty or invalid NarrowPeak line: %s".format(line))
} else if (stringency == ValidationStringency.STRICT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be LENIENT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just patched, thanks for catching.

@fnothaft fnothaft force-pushed the issues/1009-single-file-save branch from 740e520 to f21ead5 Compare September 6, 2016 22:37
@fnothaft
Copy link
Member Author

fnothaft commented Sep 6, 2016

@heuermh if this looks good, can you merge? Merging this will simplify #1145.

@@ -169,18 +179,28 @@ class GFF3Parser extends FeatureParser {
* In lieu of a specification, see:
* https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/util/IntervalList.html
*/
class IntervalListParser extends Serializable {
def parse(line: String): (Option[SequenceRecord], Option[Feature]) = {
private[rdd] class IntervalListParser extends Serializable with Logging {
Copy link
Member

@heuermh heuermh Sep 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are reusing the SAM header code for IntervalList format I was thinking we could remove all the startsWith("@") stuff here.

This method should only need to handle the five tab-delimited columns (Sequence name, Start position (1-based), End position (1-based, end inclusive), Strand (either + or -), Interval name (an, ideally unique, name for the interval).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're reusing the SAM header writing code, not the SAM header reading code. We might be able to reuse that code, but that's less straightforward. Perhaps we can punt that to a later ticket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, I just don't have the bandwidth to take that on now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, how about just getting rid of the arbitrary dbxref key stuff at

https://github.com/fnothaft/adam/blob/f21ead579471ab7854aa519ee7efc884f2f6f38b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala#L226

then? Don't set dbxrefs at all, just attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was meant to be on a different line? That change tracks all the way back to #686. I'd want to get @ryan-williams' thoughts on that code before we yanked it out or changed it. I think we're getting into a debate that's somewhat orthogonal to this PR, which is what exactly the .interval_list format is. I've just added code to said parser so that it passes round trip tests. If we'd like to change the parser, I think we should open an issue and discuss that there, and not block this PR on it. That's just my opinion though. Perhaps let's discuss on the call tomorrow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create an issue to 1) look into using SAM header code when reading IntervalList format and 2) clean up this dbxref key stuff

@heuermh
Copy link
Member

heuermh commented Sep 6, 2016

Reviewing meow

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1461/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Sep 7, 2016

Created a new issue #1152 to resolve IntervalList discussions later.

Is this pull request ready to merge then?

@fnothaft fnothaft force-pushed the issues/1009-single-file-save branch from f21ead5 to 79749ae Compare September 7, 2016 18:02
@fnothaft
Copy link
Member Author

fnothaft commented Sep 7, 2016

Just rebased. Should be good for a merge once tests pass.

Factor single file save code out of AlignmentRecordRDD. Resolves bigdatagenomics#1009.
…List.

Resolves bigdatagenomics#1059. Factored out SAM Header writing code, used it to implement
single file save for IntervalList files.
Resolves bigdatagenomics#1058. Added `asSingleFile` to all legacy Feature format save methods
in `org.bdgenomics.adam.rdd.features.FeatureRDD`. Renamed `Features2ADAM` to
`TransformFeatures` and added the `-single` option, as well as the ability to
export back to legacy feature formats.
Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file
SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and
VCF.
Resolves bigdatagenomics#1106. Improves round trip coverage by checking feature counts after
writing out, and spot checking converted feature lines.

* Split out all string encoding functions for features to make them easier to
  test, and wrote tests to spot check the correctness of these features.
* Made end-to-end feature tests pass.
* Eliminated some redundant tests.
* Added assertions to check that the same number of features were present in the
  input and output files when writing between different formats.
* Rewrote interval list parser and encoder.
* Fixed various bugs:
  * GFF2/GTF/GFF3 parser would discard lines that lacked an attribute column.
  * NarrowPeak parser would discard INDEPENDENT strand.
  * NarrowPeak signal intensities should be written as Ints, not Floats.
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1466/
Test PASSed.

@heuermh heuermh merged commit 79749ae into bigdatagenomics:master Sep 7, 2016
@heuermh
Copy link
Member

heuermh commented Sep 7, 2016

Merged via multiple commits.

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants