Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when using biscuit align with -p option for interleaved input fastq file #19

Closed
TrueScience opened this issue Apr 27, 2022 · 21 comments
Assignees

Comments

@TrueScience
Copy link

TrueScience commented Apr 27, 2022

$ biscuit align -MC -k 12 -t 6 -p hg19.fa BST_50_1_CRC_S4_R1_R2.fastq > BST_50_1_CRC_S4.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 428404 sequences (60000236 bp)...
[bseq_classify] 0 SE sequences; 428404 PE sequences
[M::process] read 428114 sequences (60000124 bp)...
[M::mem_pestat] # candidate unique pairs: 98565
[M::mem_pestat] (25, 50, 75) percentile: (291, 307, 318)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (237, 372)
[M::mem_pestat] mean and std.dev: (304.11, 22.07)
[M::mem_pestat] low and high boundaries for proper pairs: (210, 399)
[M::mem_process_seqs] Processed 428404 reads in 965.734 CPU sec, 161.535 real sec
[bseq_classify] 0 SE sequences; 428114 PE sequences
[M::process] read 428398 sequences (60000072 bp)...
[M::mem_pestat] # candidate unique pairs: 98816
[M::mem_pestat] (25, 50, 75) percentile: (291, 307, 318)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (237, 372)
[M::mem_pestat] mean and std.dev: (304.06, 21.88)
[M::mem_pestat] low and high boundaries for proper pairs: (210, 399)
[M::mem_process_seqs] Processed 428114 reads in 995.806 CPU sec, 166.452 real sec
[bseq_classify] 0 SE sequences; 428398 PE sequences
[M::process] read 428288 sequences (60000164 bp)...
[M::mem_pestat] # candidate unique pairs: 98667
[M::mem_pestat] (25, 50, 75) percentile: (291, 307, 318)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (237, 372)
[M::mem_pestat] mean and std.dev: (304.07, 21.94)
[M::mem_pestat] low and high boundaries for proper pairs: (210, 399)
[M::mem_process_seqs] Processed 428398 reads in 1043.869 CPU sec, 174.309 real sec
Segmentation fault

BISCUIT version
Program: BISCUIT (BISulfite-seq CUI Toolkit)
Version: 0.3.8.20180515

Is this segfault issue fixed in newer versions of Biscuit?

@jamorrison
Copy link

Hi @TrueScience,

Without data to recreate your issue, I'm not sure. The newest version of BISCUIT is v1.0.2, so if you try that and find the segfault is still there, I can take a look into what's going on.

@TrueScience
Copy link
Author

Hi @jamorrison ,

We downloaded the newest verion 1.0.2 and ran the biscuit align with -p option, still got the segmentation fault error. The following is the detail:

Command:
biscuit align -MC -k 12 -@ 6 -p /local_disk0/hg19ART/hg19.fa /local_disk0/BST_50_1_CRC_S4_R1_R2.fastq > /local_disk0/BST_50_1_CRC_S4.bam

Results:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 428404 sequences (60000236 bp)...
[bseq_classify] 0 SE sequences; 428404 PE sequences
[M::process] read 428114 sequences (60000124 bp)...
[M::mem_pestat] # candidate unique pairs: 97132
[M::mem_pestat] (25, 50, 75) percentile: (289, 307, 317)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (233, 373)
[M::mem_pestat] mean and std.dev: (303.56, 22.74)
[M::mem_pestat] low and high boundaries for proper pairs: (205, 401)
[M::mem_process_seqs] Processed 428404 reads in 3812.077 CPU sec, 636.887 real sec
[bseq_classify] 0 SE sequences; 428114 PE sequences
[M::process] read 428398 sequences (60000072 bp)...
[M::mem_pestat] # candidate unique pairs: 97517
[M::mem_pestat] (25, 50, 75) percentile: (289, 307, 317)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (233, 373)
[M::mem_pestat] mean and std.dev: (303.63, 22.49)
[M::mem_pestat] low and high boundaries for proper pairs: (205, 401)
[M::mem_process_seqs] Processed 428114 reads in 3711.740 CPU sec, 618.983 real sec
[bseq_classify] 0 SE sequences; 428398 PE sequences
[M::process] read 428288 sequences (60000164 bp)...
[M::mem_pestat] # candidate unique pairs: 97301
[M::mem_pestat] (25, 50, 75) percentile: (289, 307, 317)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (233, 373)
[M::mem_pestat] mean and std.dev: (303.52, 22.64)
[M::mem_pestat] low and high boundaries for proper pairs: (205, 401)
[M::mem_process_seqs] Processed 428398 reads in 3763.686 CPU sec, 627.720 real sec
/bin/bash: line 1: 3914 Segmentation fault (core dumped) biscuit align -MC -k 12 -@ 6 -p /local_disk0/hg19ART/hg19.fa /local_disk0/BST_50_1_CRC_S4_R1_R2.fastq > /local_disk0/BST_50_1_CRC_S4.bam

Version:
Program: BISCUIT (BISulfite-seq CUI Toolkit)
Version: 1.0.2.20220113

@jamorrison
Copy link

Thanks for checking with the updated version. Can you send a small FASTQ that reproduces these results so that I can do my testing?

Thanks!

@TrueScience
Copy link
Author

Hi @jamorrison , the small fastq.gz file can't be attached here. How can I get the file to you?

@jamorrison
Copy link

Emailing it to me should work: jacob.morrison@vai.org

@TrueScience
Copy link
Author

That is awesome. Thanks!

@jamorrison
Copy link

I've had a chance to investigate this issue. There seems to be a memory leak associated with the -p option that needs to be fixed.

In the meantime, you could try splitting your interleaved FASTQ into two FASTQs and aligning that way. Our suggested method is to use two FASTQs (read 1 and 2) instead of interleaving them. I was able to successfully align your FASTQ after splitting it into read 1 and read 2 FASTQs, even though I could recreate your segfault with -p and a single FASTQ.

Also, one suggestion, BISCUIT outputs SAM format. If you want the output to be in BAM format, you will have to pipe the output to samtools to do the compression.

@TrueScience
Copy link
Author

Hi @jamorrison,

Thanks a lot for testing it out. We have to use the interleaved fastq along with the "-p" option as the input in our case. When the issue is fixed, please let us know as we are anxiously waiting.

jamorrison added a commit that referenced this issue May 4, 2022
This should fix the problem seen in Issue #19
@jamorrison
Copy link

Hi @TrueScience,

Okay, I think this issue should be fixed in commit e5bc9ed. I've verified the command you ran with your test file works and produces the same output as a separated out set of FASTQs. Let me know though if this doesn't work.

@TrueScience
Copy link
Author

Thank you so much for your quick response! We will test it asap.

@TrueScience
Copy link
Author

Hi @jamorrison,

Have you updated the binary file, after the commit e5bc9ed, or we have to install it from the source? Thanks.

@TrueScience
Copy link
Author

TrueScience commented May 5, 2022

Hi @jamorrison,

We have tested the version 1.0.3.dev with -p for interleaved fastq input and got the expected results! No segfault or any other errors. Thank you so much for fixing this issue so quickly!

@jamorrison
Copy link

Awesome. Glad it worked! This fix is will be in the next release of BISCUIT. Thanks for drawing my attention to it.

@TrueScience
Copy link
Author

TrueScience commented Feb 16, 2023 via email

@jamorrison
Copy link

Hi @TrueScience,

The publication is currently in progress and should be ready soon. As a general overview though, BISCUIT is based on the bwa-mem alignment algorithm, with some adjustments to account for C>T and G>A conversions due to bisulfite conversion in WGBS.

@TrueScience
Copy link
Author

TrueScience commented Oct 31, 2023 via email

@jamorrison
Copy link

jamorrison commented Nov 1, 2023

Hi Ian,

In principle, I think you could align non-bisulfite converted datasets. However, because BISCUIT is tolerant of C (ref) →T (read) mismatches, you may run into issues with high quality data having an increased number of secondary alignments. For FFPE and ancient DNA, with its elevated C→T rate, BISCUIT may actually be of benefit because of that C→T tolerance. That said, for high quality datasets, I should say that you'd probably be better off aligning with a tool designed for non-converted datasets (bwa, bwa-mem2, minimap2 (depending on use-case), bowtie2, etc.).

The BISCUIT index has three main components: a 4-base packed reference and two Burrows-Wheeler transformed genomes with spaced FM-indexes. The indexes are both concatenations of the forward and reverse strands of the reference, but one index is entirely C→T converted and the other is entirely G→A converted. No index is created for the native 4-base reference (i.e., an index where no conversion has occurred).

Initial candidate locations for alignment are found by in silico bisulfite converting substrings of the read and finding exact matches of these "seeds" in the indexes. Locations with exact matches are filtered, merged together based on genomic proximity, and then scored.

Rather than using the in silico converted reads for scoring, BISCUIT uses an asymmetric scoring scheme against the 4-base reference. This scoring scheme allows T's (A's) in the read to align to a C or T (G or A) in the reference (i.e., not scored as a mismatch), but the reverse is scored as a mismatch (C in read cannot align to T in reference).

Hopefully this helps answer your question. Let me know if you need more clarification.

As for the BISCUIT publication, we're still working out way through revisions.

@TrueScience
Copy link
Author

TrueScience commented Nov 1, 2023 via email

@jamorrison
Copy link

As a clarification, the scoring is done against an unconverted reference, so a methylated C in a OT/CTOT read (the strands with C→T conversions) will be scored as a match against a C in the reference, but as a mismatch to a T in the reference. On the other hand, a T in a read will be scored as match against either a C or a T due to the reference being unconverted and the T either coming from a unmethylated C or an original T.

With respect to alignment versus mapping quality scores, the alignment score is a main component of the mapping quality score, so lower alignment scores will impact the mapping quality score.

My thoughts on the number of secondary alignments increasing is related to T's being scored as matches against both C and T in the reference. Because C→T is the most common SNP and BISCUIT will map the resulting T to both C and T, you could end up with an increased number of reads that are scored the same in multiple locations with similar sequences, which wouldn't necessarily affect your alignment score, but it would hurt your mapping quality score since there are multiple locations with the same alignment score (which severely penalizes mapping quality scores).

I will say that this thinking is more of a gut feeling and I can think of a couple counterarguments off the top of my head as to why it might be wrong (only one C→T SNP in a read won't necessarily mean you'll get more secondary mappings, using unconverted data means there are more Cs in reads that can only map to Cs in the reference (i.e., scored as a match) which will restrict locations your reads can map to). It'd be an interested experiment to map an unconverted dataset with BISCUIT and bwa-mem and compare the alignments to see the impact on alignment locations, alignment scores, and mapping quality scores. That would probably give you the best idea of whether aligning unconverted datasets with BISCUIT is a viable option. (And I'd be interested to hear what the results of those comparisons are if you do end up trying that.)

@TrueScience
Copy link
Author

TrueScience commented Nov 6, 2023 via email

@jamorrison
Copy link

Thanks for being willing to keep me posted!

My understanding of bwa mem (and therefore BISCUIT) is that you can't turn off soft-clipping - even with a very high value for -L. I think this is due to the algorithm that is used to score alignments (see this bwa issue and the referenced minimap2 issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants