-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolves various single file save/header issues #1104
Resolves various single file save/header issues #1104
Conversation
Test PASSed. |
Resolves #1106 now too. Ready for review/merge. |
Test PASSed. |
val start = feature.getStart + 1 // IntervalList ranges are 1-based | ||
val end = feature.getEnd // IntervalList ranges are closed | ||
val strand = Features.asString(feature.getStrand, emptyUnknown = false) | ||
val attributes = Features.gatherAttributes(feature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes back to shoving all attributes into the IntervalList format name column, right? I saw that as an abuse to the format and generated a name with reasonable defaults, see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/Features.scala#L207.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% here—I'm a bit unfamiliar with the IntervalList format, even after this PR—but the implementation here passes round trip (with the exception of some ordering bits) with the interval list file in our test resources directory. For me to understand the difference a bit better, could you perhaps elucidate how this code would change the output? E.g., how would the resultant interval list line look different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The javadoc for IntervalList (all the spec I'm aware of) describes the last column as "Interval name (an, ideally unique, name for the interval)". I haven't seen many of these files in the wild, so I don't know if |
delimited attributes in the Interval name column is a reasonable thing to do.
Thus I found the previous code somewhat unsavory and arbitrary (as some hard coded keys are shoved into dbxrefs), see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala#L199.
I hadn't got to round tripping; I guess I was planning to shove the Interval name column into Feature.name
field when reading IntervalList format and use Features.nameOf
when writing. And then find a different file for the unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to get back to this, the reason I've shoved records in here is that this is the format we got in the interval list file that described the exome panel. I don't know what the "best" approach is, but I don't think interval list is a rigorously specified format, so we're somewhat making things up as we go here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, without seeing more examples of these in the wild, I'm not sure if the name column should be mapped to Feature.name
or to Feature.attributes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok with me for now, will resolve in future issue
Looks pretty good. Will try it out on various files tomorrow (and also hopefully re-remember how to use our internal cluster). |
I attempted a round-trip BAM file conversion like:
and although the resulting Also When not using -single If you have trouble replicating - let me know and I'll get you the file. |
Thanks for looking at this @jpdna! By any chance, have you tried running any of the Picard metrics tools on the file? Our round trip SAM/BAM unit tests are passing, so I believe that htsjdk is able to read the file fine. I wonder if we're writing the BAM header in a way so that htsjdk and htslib read the file differently? I'll take a look tomorrow. |
Interestingly - Picard seems to not like even the ORIGINAL test input file
Here is the original bam file I am feeding into this round-trip test: I'd write this off to just an incorrect BAM file that samtools is too permissive on, but the test file was derived from the 1000 genomes files available here: I also just tried that whole 1000 genomes with the Picard command above, and Picard produces the same error, but samtools flagstat and view seems to work fine on it. So - it would be interesting to know exactly what is up with these 1000 genomes files that makes htsjdk unable to read them ( related to "bin" field I guess) - and thus ADAM can't either, and see if this jdk error message can at least be passed on more clearly in ADAM rather than just dropping the reads - if indeed it is legitimate to refuse to process these. |
I googled a bit and find that setting
as discussed here : http://gatkforums.broadinstitute.org/gatk/discussion/4290/sam-bin-field-error-for-the-gatk-run So perhaps we need to see if we can allow the same "lenient" mode in ADAM? |
I don't think that applies for ADAM, as we're not using the index. |
The ADAM parquet file Also looking further at the per partition "bam" files named part-r-0000* produced by |
Trying to output a Also - I did same test and got same result of samtools not being to read the round tripped bam when using the
|
<3 htslib/htsjdk format disagreements @jpdna thanks for testing this on samtools. I'll go ahead and debug this further. As an aside, we should add an integration test that reads the files in from samtools, or some other tool that uses htslib. I know that htslib and htsjdk also disagree on VCF, but this one is somewhat shocking. If we can root cause the issue, we should report it upstream. |
I'd be happy to have a go at making such integration test with samtools, I'll open an issue. We can then test the fix with it when ready. |
Awesome! Thanks @jpdna! |
I'm looking at this now. It looks like we write the header completely correctly. I'm not sure what exactly is going on, so I'm unrolling some of the code so that we can write uncompressed BAM (at least for the header) so that I can look at a hexdump to see what's going wrong. Otherwise, we've got some bad stuff going on in the |
Just pinging on this @fnothaft - I think you are continuing to dig through the hexdumps? |
Thanks for the ping @jpdna ; let me get back on this... |
17cea3a
to
9bd6785
Compare
9bd6785
to
74928e6
Compare
Test PASSed. |
Ping for review/merge? |
I'm not sure we resolved the IntervalList format name column discussion above, and a few of the other review comments still need addressing. On my end, I want to run through some tests on the cluster. |
Ah, sorry! I did a quick pass over the PR earlier and thought I'd addressed the comments. Evidently, I read way too fast. I'll make a pass over those tomorrow. |
74928e6
to
740e520
Compare
@heuermh just rebased and addressed your review comments. Let me know your thoughts. |
Test PASSed. |
log.warn("Empty or invalid NarrowPeak line: {}", line) | ||
if (stringency == ValidationStringency.STRICT) { | ||
throw new IllegalArgumentException("Empty or invalid NarrowPeak line: %s".format(line)) | ||
} else if (stringency == ValidationStringency.STRICT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be LENIENT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just patched, thanks for catching.
740e520
to
f21ead5
Compare
@@ -169,18 +179,28 @@ class GFF3Parser extends FeatureParser { | |||
* In lieu of a specification, see: | |||
* https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/util/IntervalList.html | |||
*/ | |||
class IntervalListParser extends Serializable { | |||
def parse(line: String): (Option[SequenceRecord], Option[Feature]) = { | |||
private[rdd] class IntervalListParser extends Serializable with Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are reusing the SAM header code for IntervalList format I was thinking we could remove all the startsWith("@")
stuff here.
This method should only need to handle the five tab-delimited columns (Sequence name, Start position (1-based), End position (1-based, end inclusive), Strand (either + or -), Interval name (an, ideally unique, name for the interval).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're reusing the SAM header writing code, not the SAM header reading code. We might be able to reuse that code, but that's less straightforward. Perhaps we can punt that to a later ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, I just don't have the bandwidth to take that on now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, how about just getting rid of the arbitrary dbxref key stuff at
then? Don't set dbxrefs at all, just attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was meant to be on a different line? That change tracks all the way back to #686. I'd want to get @ryan-williams' thoughts on that code before we yanked it out or changed it. I think we're getting into a debate that's somewhat orthogonal to this PR, which is what exactly the .interval_list
format is. I've just added code to said parser so that it passes round trip tests. If we'd like to change the parser, I think we should open an issue and discuss that there, and not block this PR on it. That's just my opinion though. Perhaps let's discuss on the call tomorrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create an issue to 1) look into using SAM header code when reading IntervalList format and 2) clean up this dbxref key stuff
Reviewing meow |
Test PASSed. |
Created a new issue #1152 to resolve IntervalList discussions later. Is this pull request ready to merge then? |
f21ead5
to
79749ae
Compare
Just rebased. Should be good for a merge once tests pass. |
Factor single file save code out of AlignmentRecordRDD. Resolves bigdatagenomics#1009.
…List. Resolves bigdatagenomics#1059. Factored out SAM Header writing code, used it to implement single file save for IntervalList files.
Resolves bigdatagenomics#1058. Added `asSingleFile` to all legacy Feature format save methods in `org.bdgenomics.adam.rdd.features.FeatureRDD`. Renamed `Features2ADAM` to `TransformFeatures` and added the `-single` option, as well as the ability to export back to legacy feature formats.
Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and VCF.
Resolves bigdatagenomics#1106. Improves round trip coverage by checking feature counts after writing out, and spot checking converted feature lines. * Split out all string encoding functions for features to make them easier to test, and wrote tests to spot check the correctness of these features. * Made end-to-end feature tests pass. * Eliminated some redundant tests. * Added assertions to check that the same number of features were present in the input and output files when writing between different formats. * Rewrote interval list parser and encoder. * Fixed various bugs: * GFF2/GTF/GFF3 parser would discard lines that lacked an attribute column. * NarrowPeak parser would discard INDEPENDENT strand. * NarrowPeak signal intensities should be written as Ints, not Floats.
Test PASSed. |
Merged via multiple commits. Thank you, @fnothaft! |
Resolves #1009, #1059, #1058, #676.