Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INDEL realigner cleanup #1412

Merged

Conversation

fnothaft
Copy link
Member

@fnothaft fnothaft commented Mar 2, 2017

Fixes #1402, and provides many additional improvements to the IR core. Still a WIP, testing on the cluster as we speak.

@coveralls
Copy link

coveralls commented Mar 2, 2017

Coverage Status

Coverage increased (+0.7%) to 77.008% when pulling 3cf32f6 on fnothaft:issues/1402-indel-realigner-what into eb4aa6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1823/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 2, 2017

Coverage Status

Coverage increased (+1.0%) to 77.3% when pulling 31c68fa on fnothaft:issues/1402-indel-realigner-what into eb4aa6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1825/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 3, 2017

Coverage Status

Coverage increased (+0.9%) to 77.174% when pulling fe8a2f2 on fnothaft:issues/1402-indel-realigner-what into eb4aa6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1829/
Test PASSed.

Copy link
Member

@heuermh heuermh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor doc changes

@@ -77,6 +77,10 @@ class TransformArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
var lodThreshold = 5.0
@Args4jOption(required = false, name = "-max_target_size", usage = "The maximum length of a target region to attempt realigning. Default length is 3000.")
var maxTargetSize = 3000
@Args4jOption(required = false, name = "-max_reads_per_target", usage = "The maximum number of reads attached to a target that we will attempt realigning. Default length is 20000.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about The maximum number of reads attached to a target considered for realignment. Default is 20000.

@@ -421,20 +435,30 @@ private[read] class RealignIndels(
* @param read Read to test.
* @param reference Reference sequence to sweep across.
* @param qualities Integer sequence of phred scaled base quality scores.
* @param originalQuality
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs doc comment

@fnothaft fnothaft added this to the 0.21.1 milestone Mar 3, 2017
@@ -208,7 +208,9 @@ object MdTag {

// dirty dancing to recalculate match sets
for (i <- 0 until cigarElement.getLength) {
if (reference(referencePos) == sequence(readPos)) {
println(reference + " @ " + referencePos + ", " + sequence + " @ " + readPos)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1834/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1412/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b8c4222ca4c78a234f74dfd72089d9343e62fa58 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1412/merge^{commit} # timeout=10Checking out Revision b8c4222ca4c78a234f74dfd72089d9343e62fa58 (origin/pr/1412/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b8c4222ca4c78a234f74dfd72089d9343e62fa58First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the issues/1402-indel-realigner-what branch from e533c90 to 745df4f Compare March 4, 2017 05:32
@coveralls
Copy link

coveralls commented Mar 4, 2017

Coverage Status

Coverage increased (+0.8%) to 77.192% when pulling 745df4f on fnothaft:issues/1402-indel-realigner-what into 07c1982 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1835/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1837/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1412/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 6d408c4 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1412/merge^{commit} # timeout=10Checking out Revision 6d408c4 (origin/pr/1412/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 6d408c471459434c79630e5724f1a67c964b88cbFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls
Copy link

coveralls commented Mar 15, 2017

Coverage Status

Coverage increased (+1.0%) to 77.466% when pulling 8a63091 on fnothaft:issues/1402-indel-realigner-what into cf39e6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1877/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1412/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 0761d92 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1412/merge^{commit} # timeout=10Checking out Revision 0761d92 (origin/pr/1412/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 0761d9214029fa634f56bc4a72ea7a61e7d29d71First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Jenkins, retest this please.

@coveralls
Copy link

coveralls commented Mar 15, 2017

Coverage Status

Coverage increased (+1.0%) to 77.466% when pulling 8a63091 on fnothaft:issues/1402-indel-realigner-what into cf39e6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1879/
Test PASSed.

@fnothaft fnothaft force-pushed the issues/1402-indel-realigner-what branch from 8a63091 to 80ad790 Compare March 23, 2017 04:34
@coveralls
Copy link

coveralls commented Mar 23, 2017

Coverage Status

Coverage increased (+0.8%) to 81.35% when pulling 80ad790 on fnothaft:issues/1402-indel-realigner-what into 629b778 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1906/
Test PASSed.

@fnothaft
Copy link
Member Author

Thanks @heuermh! Glad to have things cleaned up.

@coveralls
Copy link

coveralls commented Mar 26, 2017

Coverage Status

Coverage increased (+0.6%) to 81.226% when pulling 54f328f on fnothaft:issues/1402-indel-realigner-what into 629b778 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1912/
Test PASSed.

@fnothaft fnothaft force-pushed the issues/1402-indel-realigner-what branch from 54f328f to e6376e8 Compare March 28, 2017 18:52
@coveralls
Copy link

coveralls commented Mar 28, 2017

Coverage Status

Coverage increased (+0.6%) to 81.439% when pulling e6376e8 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1915/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 28, 2017

Coverage Status

Coverage increased (+0.7%) to 81.541% when pulling 99f4673 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1916/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 29, 2017

Coverage Status

Coverage increased (+0.7%) to 81.523% when pulling f82fe65 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1917/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Mar 29, 2017

Don't unclip. Yes, or do I mean no?

@fnothaft fnothaft force-pushed the issues/1402-indel-realigner-what branch from f82fe65 to 2c868e1 Compare March 30, 2017 06:44
@fnothaft
Copy link
Member Author

@heuermh think this is good to go now; rerunning a final verification run. Can you give the second commit a looksee? That should be everything new since your last pass.

@coveralls
Copy link

coveralls commented Mar 30, 2017

Coverage Status

Coverage increased (+0.8%) to 81.652% when pulling 2c868e1 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1918/
Test PASSed.

val startClipped = richCigar.softClippedBasesAtStart
val endClipped = richCigar.softClippedBasesAtEnd

if (unclipReads ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting hard to follow the double negative here. Maybe I need more sleep. :)

builder.setOldPosition(r.getStart())
builder.setOldCigar(r.getCigar())
val rec = builder.build()
if (r.getReadName == "H06JUADXX130110:1:1108:12070:36897") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wot?

} else {
r
}
} catch {
case t: Throwable => {
log.warn("Realigning read %s failed with %s.".format(r, t))
log.warn("Realigning read %s failed with %s. At:".format(r, t))
val stack = t.getStackTrace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.warn(String, Throwable)?

* @param remappingIdx The location in the consensus sequence where this
* read has been realigned.
* @param consensus The consensus variant to realign against.
* @peturn Returns a tuple with the (read start, read end, cigar).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peturn@return

}

// if read is fully before or fully after the consensus, then we
// have a full match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comments here

@coveralls
Copy link

coveralls commented Mar 30, 2017

Coverage Status

Coverage increased (+0.9%) to 81.716% when pulling 106da85 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1919/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 31, 2017

Coverage Status

Coverage increased (+0.7%) to 81.586% when pulling 3efad26 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1921/
Test PASSed.

Resolves bigdatagenomics#1402. Includes fixes to consensus generator and reference scorer.

Improve INDEL realigner performance:

* Exit early when realigning will not yield a better score.
* Eliminate substring call in sweep over reference.
* Change datastructures to be immutable wherever possible.
* Add bound checking and other error checking.
* Rewrite target association code to use array instead of set, and improve load balancing.
* Delete high coverage targets with reduceByKey.

Additionally:
* Improve telemetry/logging to sort out load balance issue.
* Support using reference file in INDEL realignment.
* Log reads with negative alignment sizes.
* Improved test coverage for insertion realignment.
* Fix CIGARs on reads that partially overlap INDEL.
* Soft clip reads that partially align to an insertion.
* Eliminate non-determinism.
* Fixed reference file.
* Serialization fixes and debug.
* Fix bad score.
* Clean up clipping code?
* Unclip clipped reads.
@fnothaft fnothaft force-pushed the issues/1402-indel-realigner-what branch from 3efad26 to b6de9a5 Compare March 31, 2017 07:11
@fnothaft
Copy link
Member Author

Squashed down; I say this is good to go from my end.

@coveralls
Copy link

coveralls commented Mar 31, 2017

Coverage Status

Coverage increased (+0.7%) to 81.586% when pulling b6de9a5 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1922/
Test PASSed.

@heuermh heuermh merged commit d12fdfb into bigdatagenomics:master Mar 31, 2017
@heuermh
Copy link
Member

heuermh commented Mar 31, 2017

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants