-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when using biscuit align with -p option for interleaved input fastq file #19
Comments
Hi @TrueScience, Without data to recreate your issue, I'm not sure. The newest version of BISCUIT is v1.0.2, so if you try that and find the segfault is still there, I can take a look into what's going on. |
Hi @jamorrison , We downloaded the newest verion 1.0.2 and ran the biscuit align with -p option, still got the segmentation fault error. The following is the detail: Command: Results: Version: |
Thanks for checking with the updated version. Can you send a small FASTQ that reproduces these results so that I can do my testing? Thanks! |
Hi @jamorrison , the small fastq.gz file can't be attached here. How can I get the file to you? |
Emailing it to me should work: jacob.morrison@vai.org |
That is awesome. Thanks! |
I've had a chance to investigate this issue. There seems to be a memory leak associated with the In the meantime, you could try splitting your interleaved FASTQ into two FASTQs and aligning that way. Our suggested method is to use two FASTQs (read 1 and 2) instead of interleaving them. I was able to successfully align your FASTQ after splitting it into read 1 and read 2 FASTQs, even though I could recreate your segfault with Also, one suggestion, BISCUIT outputs SAM format. If you want the output to be in BAM format, you will have to pipe the output to |
Hi @jamorrison, Thanks a lot for testing it out. We have to use the interleaved fastq along with the "-p" option as the input in our case. When the issue is fixed, please let us know as we are anxiously waiting. |
This should fix the problem seen in Issue #19
Hi @TrueScience, Okay, I think this issue should be fixed in commit e5bc9ed. I've verified the command you ran with your test file works and produces the same output as a separated out set of FASTQs. Let me know though if this doesn't work. |
Thank you so much for your quick response! We will test it asap. |
Hi @jamorrison, Have you updated the binary file, after the commit e5bc9ed, or we have to install it from the source? Thanks. |
Hi @jamorrison, We have tested the version 1.0.3.dev with -p for interleaved fastq input and got the expected results! No segfault or any other errors. Thank you so much for fixing this issue so quickly! |
Awesome. Glad it worked! This fix is will be in the next release of BISCUIT. Thanks for drawing my attention to it. |
Hi Jacob,
I was wondering if there is a publication that talks about Biscuit,
particularly its strategy used in alignment. I searched it online, but
couldn't find one. Could you please let me know if you have info on this?
Thank you very much!
Ian
…On Thu, May 5, 2022 at 7:47 AM Jacob Morrison ***@***.***> wrote:
Closed #19 <#19>.
—
Reply to this email directly, view it on GitHub
<#19 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJSDX3CBTXZMVZENSL2D3ZTVIO7PDANCNFSM5UQLACKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @TrueScience, The publication is currently in progress and should be ready soon. As a general overview though, BISCUIT is based on the bwa-mem alignment algorithm, with some adjustments to account for C>T and G>A conversions due to bisulfite conversion in WGBS. |
Hi Jacob,
I hope all is well!
I just have a question about using Biscuit. Can I use Biscuit to align
non-bisulfite converted sequencing datasets? I think in principle it should
be OK, but do you think there are potential issues? When Biscuit aligns
reads to the genome index, will it consider the native genome index first,
followed by C>T and G>A converted indexes? Are C>T and G>A conversions
considered in alignment score calculations?
By the way, if your Biscuit publication is available, please let me know.
Thank you so much!
Ian
…On Mon, Mar 6, 2023 at 9:54 AM Jacob Morrison ***@***.***> wrote:
Hi @TrueScience <https://github.com/TrueScience>,
The publication is currently in progress and should be ready soon. As a
general overview though, BISCUIT is based on the bwa-mem alignment
algorithm, with some adjustments to account for C>T and G>A conversions due
to bisulfite conversion in WGBS.
—
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJSDX3B43WHGHVMBIGAL5ITW2YCCDANCNFSM5UQLACKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Ian, In principle, I think you could align non-bisulfite converted datasets. However, because BISCUIT is tolerant of C (ref) →T (read) mismatches, you may run into issues with high quality data having an increased number of secondary alignments. For FFPE and ancient DNA, with its elevated C→T rate, BISCUIT may actually be of benefit because of that C→T tolerance. That said, for high quality datasets, I should say that you'd probably be better off aligning with a tool designed for non-converted datasets (bwa, bwa-mem2, minimap2 (depending on use-case), bowtie2, etc.). The BISCUIT index has three main components: a 4-base packed reference and two Burrows-Wheeler transformed genomes with spaced FM-indexes. The indexes are both concatenations of the forward and reverse strands of the reference, but one index is entirely C→T converted and the other is entirely G→A converted. No index is created for the native 4-base reference (i.e., an index where no conversion has occurred). Initial candidate locations for alignment are found by in silico bisulfite converting substrings of the read and finding exact matches of these "seeds" in the indexes. Locations with exact matches are filtered, merged together based on genomic proximity, and then scored. Rather than using the in silico converted reads for scoring, BISCUIT uses an asymmetric scoring scheme against the 4-base reference. This scoring scheme allows Hopefully this helps answer your question. Let me know if you need more clarification. As for the BISCUIT publication, we're still working out way through revisions. |
Hi Jacob,
Thanks a lot for your quick response! It is extremely helpful! We sometimes
have BST-converted samples and native samples in the same dataset. We feel
like it would be convenient to use Biscuit for both cases. But your point
is well taken.
"This scoring scheme allows T's (A's) in the read to align to a C or T (G
or A) in the reference (i.e., not scored as a mismatch), but the reverse
is scored as a mismatch (C in read cannot align to T in reference)." Does
this mean that a methylated C in a forward read will be counted as a
mismatch and take a penalty in alignment score, since the reference is
always T? I would assume that the score here only affects the
alignment score, not the read mapping quality. If this is the case, I can
see that a native DNA sequence (i.e. not treated with BST) will have a
lower alignment score across the board, but expected or un-affected mapping
quality. Do you think this is an accurate statement?
Best regards,
Ian
…On Wed, Nov 1, 2023 at 12:58 PM Jacob Morrison ***@***.***> wrote:
Hi Ian,
In principle, I think you could align non-bisulfite converted datasets.
However, I should say that if you're trying to align WGS data, you'd
probably be better off aligning with a tool designed for those types of
reads (bwa, bwa-mem2, minimap2 (depending on use-case), bowtie2, etc.).
The BISCUIT index has three main components: a 4-base packed reference and
two Burrows-Wheeler transformed genomes with spaced FM-indexes. The indexes
are both concatenations of the forward and reverse strands of the
reference, but one index is entirely C→T converted and the other is
entirely G→A converted. No index is created for the native 4-base reference
(i.e., an index where no conversion has occurred).
Initial candidate locations for alignment are found by *in silico*
bisulfite converting substrings of the read and finding exact matches of
these "seeds" in the indexes. Locations with exact matches are filtered,
merged together based on genomic proximity, and then scored.
Rather than using the *in silico* converted reads for scoring, BISCUIT
uses an asymmetric scoring scheme against the 4-base reference. This
scoring scheme allows T's (A's) in the read to align to a C or T (G or A)
in the reference (i.e., not scored as a mismatch), but the reverse is
scored as a mismatch (C in read cannot align to T in reference).
Hopefully this helps answer your question. Let me know if you need more
clarification.
As for the BISCUIT publication, we're still working out way through
revisions.
—
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJSDX3AIANNQ3GPCY4M6UBDYCKESZAVCNFSM5UQLACK2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYHE2DANJYGI3Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
As a clarification, the scoring is done against an unconverted reference, so a methylated C in a OT/CTOT read (the strands with C→T conversions) will be scored as a match against a C in the reference, but as a mismatch to a T in the reference. On the other hand, a T in a read will be scored as match against either a C or a T due to the reference being unconverted and the T either coming from a unmethylated C or an original T. With respect to alignment versus mapping quality scores, the alignment score is a main component of the mapping quality score, so lower alignment scores will impact the mapping quality score. My thoughts on the number of secondary alignments increasing is related to T's being scored as matches against both C and T in the reference. Because C→T is the most common SNP and BISCUIT will map the resulting T to both C and T, you could end up with an increased number of reads that are scored the same in multiple locations with similar sequences, which wouldn't necessarily affect your alignment score, but it would hurt your mapping quality score since there are multiple locations with the same alignment score (which severely penalizes mapping quality scores). I will say that this thinking is more of a gut feeling and I can think of a couple counterarguments off the top of my head as to why it might be wrong (only one C→T SNP in a read won't necessarily mean you'll get more secondary mappings, using unconverted data means there are more Cs in reads that can only map to Cs in the reference (i.e., scored as a match) which will restrict locations your reads can map to). It'd be an interested experiment to map an unconverted dataset with BISCUIT and bwa-mem and compare the alignments to see the impact on alignment locations, alignment scores, and mapping quality scores. That would probably give you the best idea of whether aligning unconverted datasets with BISCUIT is a viable option. (And I'd be interested to hear what the results of those comparisons are if you do end up trying that.) |
Hi Jacob,
Thanks a lot for your detailed explanation! I really appreciate it. Yes, we
will probably run non-BST converted samples with Biscuit and BWA mem to
compare the results and keep you posted.
By the way, to turn off the soft-clipping in Biscuit (or BWA mem)
alignment, what number should I set for "-L"? It should be positive
indefinite
in principle, but I think a real number should be entered.
Best regards,
Ian
…On Fri, Nov 3, 2023 at 9:20 AM Jacob Morrison ***@***.***> wrote:
As a clarification, the scoring is done against an unconverted reference,
so a methylated C in a OT/CTOT read (the strands with C→T conversions) will
be scored as a match against a C in the reference, but as a mismatch to a T
in the reference. On the other hand, a T in a read will be scored as match
against either a C or a T due to the reference being unconverted and the T
either coming from a unmethylated C or an original T.
With respect to alignment versus mapping quality scores, the alignment
score is a main component of the mapping quality score, so lower alignment
scores will impact the mapping quality score.
My thoughts on the number of secondary alignments increasing is related to
T's being scored as matches against both C and T in the reference. Because
C→T is the most common SNP and BISCUIT will map the resulting T to both C
and T, you could end up with an increased number of reads that are scored
the same in multiple locations with similar sequences, which wouldn't
necessarily affect your alignment score, but it would hurt your mapping
quality score since there are multiple locations with the same alignment
score (which severely penalizes mapping quality scores).
I will say that this thinking is more of a gut feeling and I can think of
a couple counterarguments off the top of my head as to why it might be
wrong (only one C→T SNP in a read won't necessarily mean you'll get more
secondary mappings, using unconverted data means there are more Cs in reads
that can only map to Cs in the reference (i.e., scored as a match) which
will restrict locations your reads can map to). It'd be an interested
experiment to map an unconverted dataset with BISCUIT and bwa-mem and
compare the alignments to see the impact on alignment locations, alignment
scores, and mapping quality scores. That would probably give you the best
idea of whether aligning unconverted datasets with BISCUIT is a viable
option. (And I'd be interested to hear what the results of those
comparisons are if you do end up trying that.)
—
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJSDX3A3GPB6HVGULX7N3OLYCT4RHAVCNFSM5UQLACK2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZZGI2TEMRYGI4Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for being willing to keep me posted! My understanding of |
$ biscuit align -MC -k 12 -t 6 -p hg19.fa BST_50_1_CRC_S4_R1_R2.fastq > BST_50_1_CRC_S4.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 428404 sequences (60000236 bp)...
[bseq_classify] 0 SE sequences; 428404 PE sequences
[M::process] read 428114 sequences (60000124 bp)...
[M::mem_pestat] # candidate unique pairs: 98565
[M::mem_pestat] (25, 50, 75) percentile: (291, 307, 318)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (237, 372)
[M::mem_pestat] mean and std.dev: (304.11, 22.07)
[M::mem_pestat] low and high boundaries for proper pairs: (210, 399)
[M::mem_process_seqs] Processed 428404 reads in 965.734 CPU sec, 161.535 real sec
[bseq_classify] 0 SE sequences; 428114 PE sequences
[M::process] read 428398 sequences (60000072 bp)...
[M::mem_pestat] # candidate unique pairs: 98816
[M::mem_pestat] (25, 50, 75) percentile: (291, 307, 318)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (237, 372)
[M::mem_pestat] mean and std.dev: (304.06, 21.88)
[M::mem_pestat] low and high boundaries for proper pairs: (210, 399)
[M::mem_process_seqs] Processed 428114 reads in 995.806 CPU sec, 166.452 real sec
[bseq_classify] 0 SE sequences; 428398 PE sequences
[M::process] read 428288 sequences (60000164 bp)...
[M::mem_pestat] # candidate unique pairs: 98667
[M::mem_pestat] (25, 50, 75) percentile: (291, 307, 318)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (237, 372)
[M::mem_pestat] mean and std.dev: (304.07, 21.94)
[M::mem_pestat] low and high boundaries for proper pairs: (210, 399)
[M::mem_process_seqs] Processed 428398 reads in 1043.869 CPU sec, 174.309 real sec
Segmentation fault
BISCUIT version
Program: BISCUIT (BISulfite-seq CUI Toolkit)
Version: 0.3.8.20180515
Is this segfault issue fixed in newer versions of Biscuit?
The text was updated successfully, but these errors were encountered: