Segmentation fault when using biscuit align with -p option for interleaved input fastq file #19

TrueScience · 2022-04-27T20:55:50Z

$ biscuit align -MC -k 12 -t 6 -p hg19.fa BST_50_1_CRC_S4_R1_R2.fastq > BST_50_1_CRC_S4.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 428404 sequences (60000236 bp)...
[bseq_classify] 0 SE sequences; 428404 PE sequences
[M::process] read 428114 sequences (60000124 bp)...
[M::mem_pestat] # candidate unique pairs: 98565
[M::mem_pestat] (25, 50, 75) percentile: (291, 307, 318)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (237, 372)
[M::mem_pestat] mean and std.dev: (304.11, 22.07)
[M::mem_pestat] low and high boundaries for proper pairs: (210, 399)
[M::mem_process_seqs] Processed 428404 reads in 965.734 CPU sec, 161.535 real sec
[bseq_classify] 0 SE sequences; 428114 PE sequences
[M::process] read 428398 sequences (60000072 bp)...
[M::mem_pestat] # candidate unique pairs: 98816
[M::mem_pestat] (25, 50, 75) percentile: (291, 307, 318)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (237, 372)
[M::mem_pestat] mean and std.dev: (304.06, 21.88)
[M::mem_pestat] low and high boundaries for proper pairs: (210, 399)
[M::mem_process_seqs] Processed 428114 reads in 995.806 CPU sec, 166.452 real sec
[bseq_classify] 0 SE sequences; 428398 PE sequences
[M::process] read 428288 sequences (60000164 bp)...
[M::mem_pestat] # candidate unique pairs: 98667
[M::mem_pestat] (25, 50, 75) percentile: (291, 307, 318)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (237, 372)
[M::mem_pestat] mean and std.dev: (304.07, 21.94)
[M::mem_pestat] low and high boundaries for proper pairs: (210, 399)
[M::mem_process_seqs] Processed 428398 reads in 1043.869 CPU sec, 174.309 real sec
Segmentation fault

BISCUIT version
Program: BISCUIT (BISulfite-seq CUI Toolkit)
Version: 0.3.8.20180515

Is this segfault issue fixed in newer versions of Biscuit?

jamorrison · 2022-04-28T14:00:01Z

Hi @TrueScience,

Without data to recreate your issue, I'm not sure. The newest version of BISCUIT is v1.0.2, so if you try that and find the segfault is still there, I can take a look into what's going on.

TrueScience · 2022-04-28T17:27:57Z

Hi @jamorrison ,

We downloaded the newest verion 1.0.2 and ran the biscuit align with -p option, still got the segmentation fault error. The following is the detail:

Command:
biscuit align -MC -k 12 -@ 6 -p /local_disk0/hg19ART/hg19.fa /local_disk0/BST_50_1_CRC_S4_R1_R2.fastq > /local_disk0/BST_50_1_CRC_S4.bam

Results:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 428404 sequences (60000236 bp)...
[bseq_classify] 0 SE sequences; 428404 PE sequences
[M::process] read 428114 sequences (60000124 bp)...
[M::mem_pestat] # candidate unique pairs: 97132
[M::mem_pestat] (25, 50, 75) percentile: (289, 307, 317)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (233, 373)
[M::mem_pestat] mean and std.dev: (303.56, 22.74)
[M::mem_pestat] low and high boundaries for proper pairs: (205, 401)
[M::mem_process_seqs] Processed 428404 reads in 3812.077 CPU sec, 636.887 real sec
[bseq_classify] 0 SE sequences; 428114 PE sequences
[M::process] read 428398 sequences (60000072 bp)...
[M::mem_pestat] # candidate unique pairs: 97517
[M::mem_pestat] (25, 50, 75) percentile: (289, 307, 317)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (233, 373)
[M::mem_pestat] mean and std.dev: (303.63, 22.49)
[M::mem_pestat] low and high boundaries for proper pairs: (205, 401)
[M::mem_process_seqs] Processed 428114 reads in 3711.740 CPU sec, 618.983 real sec
[bseq_classify] 0 SE sequences; 428398 PE sequences
[M::process] read 428288 sequences (60000164 bp)...
[M::mem_pestat] # candidate unique pairs: 97301
[M::mem_pestat] (25, 50, 75) percentile: (289, 307, 317)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (233, 373)
[M::mem_pestat] mean and std.dev: (303.52, 22.64)
[M::mem_pestat] low and high boundaries for proper pairs: (205, 401)
[M::mem_process_seqs] Processed 428398 reads in 3763.686 CPU sec, 627.720 real sec
/bin/bash: line 1: 3914 Segmentation fault (core dumped) biscuit align -MC -k 12 -@ 6 -p /local_disk0/hg19ART/hg19.fa /local_disk0/BST_50_1_CRC_S4_R1_R2.fastq > /local_disk0/BST_50_1_CRC_S4.bam

Version:
Program: BISCUIT (BISulfite-seq CUI Toolkit)
Version: 1.0.2.20220113

jamorrison · 2022-04-28T19:11:08Z

Thanks for checking with the updated version. Can you send a small FASTQ that reproduces these results so that I can do my testing?

Thanks!

TrueScience · 2022-04-28T23:16:39Z

Hi @jamorrison , the small fastq.gz file can't be attached here. How can I get the file to you?

jamorrison · 2022-04-28T23:20:45Z

Emailing it to me should work: jacob.morrison@vai.org

TrueScience · 2022-04-28T23:21:53Z

That is awesome. Thanks!

jamorrison · 2022-05-02T18:30:24Z

I've had a chance to investigate this issue. There seems to be a memory leak associated with the -p option that needs to be fixed.

In the meantime, you could try splitting your interleaved FASTQ into two FASTQs and aligning that way. Our suggested method is to use two FASTQs (read 1 and 2) instead of interleaving them. I was able to successfully align your FASTQ after splitting it into read 1 and read 2 FASTQs, even though I could recreate your segfault with -p and a single FASTQ.

Also, one suggestion, BISCUIT outputs SAM format. If you want the output to be in BAM format, you will have to pipe the output to samtools to do the compression.

TrueScience · 2022-05-02T19:14:43Z

Hi @jamorrison,

Thanks a lot for testing it out. We have to use the interleaved fastq along with the "-p" option as the input in our case. When the issue is fixed, please let us know as we are anxiously waiting.

This should fix the problem seen in Issue #19

jamorrison · 2022-05-04T18:40:23Z

Hi @TrueScience,

Okay, I think this issue should be fixed in commit e5bc9ed. I've verified the command you ran with your test file works and produces the same output as a separated out set of FASTQs. Let me know though if this doesn't work.

TrueScience · 2022-05-04T19:45:27Z

Thank you so much for your quick response! We will test it asap.

TrueScience · 2022-05-04T21:17:31Z

Hi @jamorrison,

Have you updated the binary file, after the commit e5bc9ed, or we have to install it from the source? Thanks.

TrueScience · 2022-05-05T04:19:17Z

Hi @jamorrison,

We have tested the version 1.0.3.dev with -p for interleaved fastq input and got the expected results! No segfault or any other errors. Thank you so much for fixing this issue so quickly!

jamorrison · 2022-05-05T12:47:34Z

Awesome. Glad it worked! This fix is will be in the next release of BISCUIT. Thanks for drawing my attention to it.

TrueScience · 2023-02-16T17:32:44Z

Hi Jacob, I was wondering if there is a publication that talks about Biscuit, particularly its strategy used in alignment. I searched it online, but couldn't find one. Could you please let me know if you have info on this? Thank you very much! Ian

…

On Thu, May 5, 2022 at 7:47 AM Jacob Morrison ***@***.***> wrote: Closed #19 <#19>. — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJSDX3CBTXZMVZENSL2D3ZTVIO7PDANCNFSM5UQLACKQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jamorrison · 2023-03-06T15:53:58Z

Hi @TrueScience,

The publication is currently in progress and should be ready soon. As a general overview though, BISCUIT is based on the bwa-mem alignment algorithm, with some adjustments to account for C>T and G>A conversions due to bisulfite conversion in WGBS.

TrueScience · 2023-10-31T17:06:29Z

Hi Jacob, I hope all is well! I just have a question about using Biscuit. Can I use Biscuit to align non-bisulfite converted sequencing datasets? I think in principle it should be OK, but do you think there are potential issues? When Biscuit aligns reads to the genome index, will it consider the native genome index first, followed by C>T and G>A converted indexes? Are C>T and G>A conversions considered in alignment score calculations? By the way, if your Biscuit publication is available, please let me know. Thank you so much! Ian

…

On Mon, Mar 6, 2023 at 9:54 AM Jacob Morrison ***@***.***> wrote: Hi @TrueScience <https://github.com/TrueScience>, The publication is currently in progress and should be ready soon. As a general overview though, BISCUIT is based on the bwa-mem alignment algorithm, with some adjustments to account for C>T and G>A conversions due to bisulfite conversion in WGBS. — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJSDX3B43WHGHVMBIGAL5ITW2YCCDANCNFSM5UQLACKQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jamorrison · 2023-11-01T17:57:52Z

Hi Ian,

In principle, I think you could align non-bisulfite converted datasets. However, because BISCUIT is tolerant of C (ref) →T (read) mismatches, you may run into issues with high quality data having an increased number of secondary alignments. For FFPE and ancient DNA, with its elevated C→T rate, BISCUIT may actually be of benefit because of that C→T tolerance. That said, for high quality datasets, I should say that you'd probably be better off aligning with a tool designed for non-converted datasets (bwa, bwa-mem2, minimap2 (depending on use-case), bowtie2, etc.).

The BISCUIT index has three main components: a 4-base packed reference and two Burrows-Wheeler transformed genomes with spaced FM-indexes. The indexes are both concatenations of the forward and reverse strands of the reference, but one index is entirely C→T converted and the other is entirely G→A converted. No index is created for the native 4-base reference (i.e., an index where no conversion has occurred).

Initial candidate locations for alignment are found by in silico bisulfite converting substrings of the read and finding exact matches of these "seeds" in the indexes. Locations with exact matches are filtered, merged together based on genomic proximity, and then scored.

Rather than using the in silico converted reads for scoring, BISCUIT uses an asymmetric scoring scheme against the 4-base reference. This scoring scheme allows T's (A's) in the read to align to a C or T (G or A) in the reference (i.e., not scored as a mismatch), but the reverse is scored as a mismatch (C in read cannot align to T in reference).

Hopefully this helps answer your question. Let me know if you need more clarification.

As for the BISCUIT publication, we're still working out way through revisions.

TrueScience · 2023-11-01T22:18:43Z

Hi Jacob, Thanks a lot for your quick response! It is extremely helpful! We sometimes have BST-converted samples and native samples in the same dataset. We feel like it would be convenient to use Biscuit for both cases. But your point is well taken. "This scoring scheme allows T's (A's) in the read to align to a C or T (G or A) in the reference (i.e., not scored as a mismatch), but the reverse is scored as a mismatch (C in read cannot align to T in reference)." Does this mean that a methylated C in a forward read will be counted as a mismatch and take a penalty in alignment score, since the reference is always T? I would assume that the score here only affects the alignment score, not the read mapping quality. If this is the case, I can see that a native DNA sequence (i.e. not treated with BST) will have a lower alignment score across the board, but expected or un-affected mapping quality. Do you think this is an accurate statement? Best regards, Ian

…

On Wed, Nov 1, 2023 at 12:58 PM Jacob Morrison ***@***.***> wrote: Hi Ian, In principle, I think you could align non-bisulfite converted datasets. However, I should say that if you're trying to align WGS data, you'd probably be better off aligning with a tool designed for those types of reads (bwa, bwa-mem2, minimap2 (depending on use-case), bowtie2, etc.). The BISCUIT index has three main components: a 4-base packed reference and two Burrows-Wheeler transformed genomes with spaced FM-indexes. The indexes are both concatenations of the forward and reverse strands of the reference, but one index is entirely C→T converted and the other is entirely G→A converted. No index is created for the native 4-base reference (i.e., an index where no conversion has occurred). Initial candidate locations for alignment are found by *in silico* bisulfite converting substrings of the read and finding exact matches of these "seeds" in the indexes. Locations with exact matches are filtered, merged together based on genomic proximity, and then scored. Rather than using the *in silico* converted reads for scoring, BISCUIT uses an asymmetric scoring scheme against the 4-base reference. This scoring scheme allows T's (A's) in the read to align to a C or T (G or A) in the reference (i.e., not scored as a mismatch), but the reverse is scored as a mismatch (C in read cannot align to T in reference). Hopefully this helps answer your question. Let me know if you need more clarification. As for the BISCUIT publication, we're still working out way through revisions. — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJSDX3AIANNQ3GPCY4M6UBDYCKESZAVCNFSM5UQLACK2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYHE2DANJYGI3Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jamorrison · 2023-11-03T14:19:52Z

As a clarification, the scoring is done against an unconverted reference, so a methylated C in a OT/CTOT read (the strands with C→T conversions) will be scored as a match against a C in the reference, but as a mismatch to a T in the reference. On the other hand, a T in a read will be scored as match against either a C or a T due to the reference being unconverted and the T either coming from a unmethylated C or an original T.

With respect to alignment versus mapping quality scores, the alignment score is a main component of the mapping quality score, so lower alignment scores will impact the mapping quality score.

My thoughts on the number of secondary alignments increasing is related to T's being scored as matches against both C and T in the reference. Because C→T is the most common SNP and BISCUIT will map the resulting T to both C and T, you could end up with an increased number of reads that are scored the same in multiple locations with similar sequences, which wouldn't necessarily affect your alignment score, but it would hurt your mapping quality score since there are multiple locations with the same alignment score (which severely penalizes mapping quality scores).

I will say that this thinking is more of a gut feeling and I can think of a couple counterarguments off the top of my head as to why it might be wrong (only one C→T SNP in a read won't necessarily mean you'll get more secondary mappings, using unconverted data means there are more Cs in reads that can only map to Cs in the reference (i.e., scored as a match) which will restrict locations your reads can map to). It'd be an interested experiment to map an unconverted dataset with BISCUIT and bwa-mem and compare the alignments to see the impact on alignment locations, alignment scores, and mapping quality scores. That would probably give you the best idea of whether aligning unconverted datasets with BISCUIT is a viable option. (And I'd be interested to hear what the results of those comparisons are if you do end up trying that.)

TrueScience · 2023-11-06T18:57:09Z

Hi Jacob, Thanks a lot for your detailed explanation! I really appreciate it. Yes, we will probably run non-BST converted samples with Biscuit and BWA mem to compare the results and keep you posted. By the way, to turn off the soft-clipping in Biscuit (or BWA mem) alignment, what number should I set for "-L"? It should be positive indefinite in principle, but I think a real number should be entered. Best regards, Ian

…

On Fri, Nov 3, 2023 at 9:20 AM Jacob Morrison ***@***.***> wrote: As a clarification, the scoring is done against an unconverted reference, so a methylated C in a OT/CTOT read (the strands with C→T conversions) will be scored as a match against a C in the reference, but as a mismatch to a T in the reference. On the other hand, a T in a read will be scored as match against either a C or a T due to the reference being unconverted and the T either coming from a unmethylated C or an original T. With respect to alignment versus mapping quality scores, the alignment score is a main component of the mapping quality score, so lower alignment scores will impact the mapping quality score. My thoughts on the number of secondary alignments increasing is related to T's being scored as matches against both C and T in the reference. Because C→T is the most common SNP and BISCUIT will map the resulting T to both C and T, you could end up with an increased number of reads that are scored the same in multiple locations with similar sequences, which wouldn't necessarily affect your alignment score, but it would hurt your mapping quality score since there are multiple locations with the same alignment score (which severely penalizes mapping quality scores). I will say that this thinking is more of a gut feeling and I can think of a couple counterarguments off the top of my head as to why it might be wrong (only one C→T SNP in a read won't necessarily mean you'll get more secondary mappings, using unconverted data means there are more Cs in reads that can only map to Cs in the reference (i.e., scored as a match) which will restrict locations your reads can map to). It'd be an interested experiment to map an unconverted dataset with BISCUIT and bwa-mem and compare the alignments to see the impact on alignment locations, alignment scores, and mapping quality scores. That would probably give you the best idea of whether aligning unconverted datasets with BISCUIT is a viable option. (And I'd be interested to hear what the results of those comparisons are if you do end up trying that.) — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJSDX3A3GPB6HVGULX7N3OLYCT4RHAVCNFSM5UQLACK2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZZGI2TEMRYGI4Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jamorrison · 2023-11-06T20:15:55Z

Thanks for being willing to keep me posted!

My understanding of bwa mem (and therefore BISCUIT) is that you can't turn off soft-clipping - even with a very high value for -L. I think this is due to the algorithm that is used to score alignments (see this bwa issue and the referenced minimap2 issue).

TrueScience assigned jamorrison Apr 27, 2022

jamorrison added a commit that referenced this issue May 4, 2022

Fix memory leaks when aligning with -p option

e5bc9ed

This should fix the problem seen in Issue #19

jamorrison closed this as completed May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when using biscuit align with -p option for interleaved input fastq file #19

Segmentation fault when using biscuit align with -p option for interleaved input fastq file #19

TrueScience commented Apr 27, 2022 •

edited

Loading

jamorrison commented Apr 28, 2022

TrueScience commented Apr 28, 2022

jamorrison commented Apr 28, 2022

TrueScience commented Apr 28, 2022

jamorrison commented Apr 28, 2022

TrueScience commented Apr 28, 2022

jamorrison commented May 2, 2022

TrueScience commented May 2, 2022

jamorrison commented May 4, 2022

TrueScience commented May 4, 2022

TrueScience commented May 4, 2022

TrueScience commented May 5, 2022 •

edited

Loading

jamorrison commented May 5, 2022

TrueScience commented Feb 16, 2023 via email

jamorrison commented Mar 6, 2023

TrueScience commented Oct 31, 2023 via email

jamorrison commented Nov 1, 2023 •

edited

Loading

TrueScience commented Nov 1, 2023 via email

jamorrison commented Nov 3, 2023

TrueScience commented Nov 6, 2023 via email

jamorrison commented Nov 6, 2023

Segmentation fault when using biscuit align with -p option for interleaved input fastq file #19

Segmentation fault when using biscuit align with -p option for interleaved input fastq file #19

Comments

TrueScience commented Apr 27, 2022 • edited Loading

jamorrison commented Apr 28, 2022

TrueScience commented Apr 28, 2022

jamorrison commented Apr 28, 2022

TrueScience commented Apr 28, 2022

jamorrison commented Apr 28, 2022

TrueScience commented Apr 28, 2022

jamorrison commented May 2, 2022

TrueScience commented May 2, 2022

jamorrison commented May 4, 2022

TrueScience commented May 4, 2022

TrueScience commented May 4, 2022

TrueScience commented May 5, 2022 • edited Loading

jamorrison commented May 5, 2022

TrueScience commented Feb 16, 2023 via email

jamorrison commented Mar 6, 2023

TrueScience commented Oct 31, 2023 via email

jamorrison commented Nov 1, 2023 • edited Loading

TrueScience commented Nov 1, 2023 via email

jamorrison commented Nov 3, 2023

TrueScience commented Nov 6, 2023 via email

jamorrison commented Nov 6, 2023

TrueScience commented Apr 27, 2022 •

edited

Loading

TrueScience commented May 5, 2022 •

edited

Loading

jamorrison commented Nov 1, 2023 •

edited

Loading