Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAM-sorting puts unmapped reads before reads mapped to lower-case-named contigs #799

Closed
ryan-williams opened this issue Aug 23, 2015 · 1 comment · Fixed by #803
Closed
Assignees

Comments

@ryan-williams
Copy link
Member

Kind of silly thing I noticed: we sort unmapped reads as if they came from a contig that starts with "ZZZ", but "Z" < "a", etc., so mapped reads on contigs named like, say, chr1 will end up after unmapped reads.

Repro:

# Replace "\t22\t" with "\tchr22\t", and "SN:22" with "SN:chr22" in bqsr1.sam
$ perl -pe 's/\t22\t/\tchr22\t/;s/SN:22/SN:chr22/' adam-cli/src/test/resources/bqsr1.sam > bqsr-chr22.sam

# Sort the resulting SAM file
$ bin/adam-submit transform -single -sort_reads bqsr-chr22.sam bqsr-chr22-sorted.sam

# Note that the first reads are now the 8 unmapped reads
$ samtools view bqsr-chr22-sorted.sam | head
SRR062634.10448889  101 *   0   0   *   chr22   16079761    0   TTTCTTTCTTTTATATATATATACACACACACACACACACACACACATATATGTATATATACACGTATATGTATGTATATATGTATATATACACGTATAT    @DF>C;FDC=EGEGGEFDGEFDD?DFDEEGFGFGGGDGGGGGGGEGGGGFGGGFGGGGGGFGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG    LB:Z:2845856850 RG:Z:SRR062634
SRR062634.16769670  101 *   0   0   *   chr22   16081625    0   GGTGAGATGATTGCTGGGATTACAGGCGTGAGCCACCGCGCCTGGCCGTATGTTTATTCTTATGATAGTACCATACTGTTTTGTAGTATGTTTTATAGCT    CCCCBCC:CA:CDCBDCBCCA?C@ABBCCBADFECFGGGGGGGGEGDGGGGGGGGFDGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG    LB:Z:2845856850 RG:Z:SRR062634
SRR062634.17698657  101 *   0   0   *   chr22   16083363    0   ATAACATTTGGTCTATAGCATCAGAGCCTTATGCACTCAGAGGAAATCCAAAATCACCTATAAGTATTTGCTGGTCCCCTCTGGGCTTAGGGAAATCTCT    AAB8AA@A=C>=>>:=:5=;5;*83:=??@7:21+88748:@CBBC:CCC<<>17,>C>>@@C??>>A???A:GGGGGGGEGGGGGGGGGGGGGGGGGGG    LB:Z:2845856850 RG:Z:SRR062634
SRR062634.17969132  133 *   0   0   *   chr22   16096818    0   TGGGCCCCGTGGACCCTGGCGACCCCCGGGGGAGGCCCGGGGGGGCCCCCTGGCCCCCAAGGGGGGGCCCCAACGGGGAGAAGGGTCCCTAGGGGGGGGG    GGGGGGGGGFGGGGGGEEGEAA<@A###########################################################################    LB:Z:2845856850 XC:i:35 RG:Z:SRR062634
SRR062634.18958430  133 *   0   0   *   chr22   16076921    0   TTTTTTCCTGTCTTGGTTGTATAAAAAAAGAGGGAGAAACGCCTGGCAGGGCACCCCAACAAAGGAAGGGAGGAGGGGGTCCCAAGGGGGCCCCGCGGGA    ####################################################################################################    LB:Z:2845856850 XC:i:35 RG:Z:SRR062634
SRR062634.20911784  133 *   0   0   *   chr22   16060584    0   TGTAGTGGCAGGGGCCCGTTATCCCAAACTACCTGGGGGGGGGGGGGGGGGGGAACACCTAAAACCCGGGGGGGGGGGGGTTGGTGGGGGCTTTATCGCA    GGGGGGGDGG@#########################################################################################    LB:Z:2845856850 XC:i:35 RG:Z:SRR062634
SRR062634.4789722   133 *   0   0   *   chr22   16071485    0   ATAATATAATGAAAAATAGAGACTGAGAGAGAAAAAACAAGACCCTTTATCTTTATATTTTTTCATATGTGTTTTTTTTTCTGTCTGGTTTTTTTGTTTT    ####################################################################################################    LB:Z:2845856850 XC:i:35 RG:Z:SRR062634
SRR062634.9119161   133 *   0   0   *   chr22   16062369    0   TTATATGTGTTTTTAAACTAAACTAATTTTATAGGAAAAATAATTTCTTTCCTTCCCTGTTATATCAAATACAGCCTTTAGCTCAAGACACAAGTAATTC    CCAC?BD?DCCD:CCACD5=DCDAB=AA=5DBC=BC:??BB85:CBBB<7<BBBBBB?A?BBB5BB:BBBB?B?BBB@??:=0====::'>;'25?????    LB:Z:2845856850 RG:Z:SRR062634
SRR062634.20563591  99  chr22   16050128    0   100M    =   16050189    161 GGACAACATTCACCTTTAAAAGTTTATTGATCTTTTGTGACATGCACGTGGGTTCCCAGTAGCAAGAAACTAAAGGGTCGCAGGCCGGTTTCTGCTAATT    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGFGGGGGGGFGGGGGEGGGGGGEFGFGGFGGGGFBGGGDE>EEECBCCEEC>E:A@=ADBCAB    X0:i:2  X1:i:0  XA:Z:14,-19792774,100M,0;   LB:Z:2845856850 MD:Z:100    RG:Z:SRR062634  XG:i:0  AM:i:0  NM:i:0  SM:i:0  XM:i:0  XO:i:0  XT:A:R
SRR062634.20563591  147 chr22   16050189    0   100M    =   16050128    -161    GCAAGAAACTAAAGGGTCGCAGGCCGGTTTCTGCTAATTTCTTTAATTCCAAGACAGTCTCAAATATTTTCTTATTAACTTCCTGGAGGGAGGCTTATCA    =DDFECDCFEDFEFFFEEFD:EF?FBFFFFBACEF?EGGGEDGGBGEGFGGGGGGFGGGGGGGGGFGGGGFFGGGGGGGFGGGGGGGGGGGGGGGGGGGG    X0:i:2  X1:i:0  XA:Z:14,+19792713,100M,0;   LB:Z:2845856850 MD:Z:100    RG:Z:SRR062634  XG:i:0  AM:i:0  NM:i:0  SM:i:0  XM:i:0  XO:i:0  XT:A:R
@fnothaft
Copy link
Member

Ah yeah, the back story on the ZZZ prefix is that it was a fix (#624) for some nasty straggler issues that we were seeing. Changing ZZZ to zzz would be a simple fix ;), although yes, it'd still be a hack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants