Stitched contig and 'N' characters #80

Wwwangwh · 2022-06-12T06:02:47Z

Hello,
Thank you for updating the projects to 2.0, it helpes a lot，but I'm a little confused about some optional arguments.

--no_padding_supercontigs
After the update, we found some ‘N’ in the output sequence.
As optional arguments described,
If the flag --run_intronerate is provided, HybPiper will attempt to recover intron sequences (if present for a given gene/sample) and a supercontig sequence. The supercontig sequence comprises both exons and introns; in cases where it has been constructed from more than one SPAdes contig, HybPiper will add a stretch of 10 'N' characters bewtween abutting contigs. This can be turned off using the flag --no_padding_supercontigs.
I wonder if the write of ‘N’ is related to ’ --run_intronerate ‘ or not. Whether in all cases 'N' is added between contigs from different sources. Because we also found the existence of ‘N’ when we assemble gene without adding --run_intronerate.

2.Whether the extracted gene sequence will contain introns by default ( hybpiper assemble -t_dna prb.fasta -r *.fasta --prefix --bwa）
We found some genes with more than 100% coverage rates (GenesAt150pct), Is this because introns are extracted or species specificity？

Thank you in advance！

chrisjackson-pellicle · 2022-06-14T00:50:57Z

Hi @Wwwangwh,

The Ns you're seeing in some of your assembled nucleotide gene sequences (and the corresponding Xs in the translated protein gene sequences) are different to the fixed stretches of 10 Ns inserted between SPAdes contigs in your supercontig sequences (the latter supercontig sequences are produced by providing the --run_intronerate flag). The following info is currently buried in the change_log.md file, but I'll add it the main Wiki as well:

In cases where HybPiper recovers sequence for multiple non-contiguous segments of a gene, the gaps between the segments will be padded with a number of 'N' characters. The number of Ns corresponds to the number of amino acids in 'best' protein reference for that gene that do not have corresponding SPAdes contig hits, multiplied by 3 to convert to nucleotides.

In Hybpiper 1.x, if non-contiguous segments were recovered for a given gene (e.g. there might be partial sequence for the 5' end and the 3' end, but no sequence for the middle section of a gene), these sequences were simply concatenated. This meant that it was up to the downstream alignment software to split the concatenated sequence at the correct point, which could easily introduce errors into the alignment. In HybPiper 2 I inserted a number of Ns proportional to any gaps to a) signal to the user that the exon sequences recovered are not contiguous (at least relative to the chosen reference from the target file - I guess it's always possible that a given taxon is actually missing a section of a gene relative to the reference); and b) to increase the accuracy of downstream alignments in these scenarios.

Would it be useful to add an optional flag that reproduces HybPiper 1.x behaviour, i.e., simply concatenating non-contiguous exon sequences?

N.B. These Ns are stripped from the gene sequences when recovering length statistics with hybpiper stats, so they don't artificially 'improve' your recovery length statistics.
The extracted gene sequences (i.e. the *.FNA and *.FAA sequences) should not contain introns. The only scenario where I can imagine this happening is if some of your target file sequences contain introns (they shouldn't). Sequences with introns are only recovered by providing the flag --run_intronerate to command hybpiper assemble, and then recovering the supercontig (or intron for introns-only) sequences using hybpiper retrieve_sequences.

The gene % length recovery stats (GenesAt75pct, GenesAt150pct, etc) are calculated by comparing the length of a given assembled gene to the mean length of the representative gene sequences in your target file. So, if:

a) For geneX, your target file contains 3 representative sequences that are 400 bp, and one that is 750 bp;
b) This latter 750 bp sequence is chosen by HybPiper as the 'best' reference sequence for geneX for sampleA;
c) HybPiper assembles a ~750 bp sequence for geneX, sampleA;

...then the sequence for geneX, sampleA will be reported in the GenesAt150pct category (as the mean reference length is 487.5 bp, and 750 > (1.5 * 487.5).

Do the genes you're seeing in the GenesAt150pct category fall in to this scenario? Happy to troubleshoot further if not.

Cheers,

Chris

Wwwangwh · 2022-06-15T08:09:39Z

Hi Chris，

Thank you for your reply，the explanation about Ns and coverage rates is detailed.

The consideration about downstream analysis is very considered，so I suppose it's unnecessary to add an optional flag to reproduces HybPiper 1.x behaviour. For those who want to achieve this result, they can easily do so by command Line.

Wangwh

mustafaraza1987 · 2022-07-08T09:10:02Z

hi, Chris
I want to know why the diamond and blastx assembly differ in time.
Also, I noticed that the diamond took 57 minutes in a sample and spades took 31 minutes.
In the same sample, blastx took 1 hour 43 minutes and in spades 21 minutes.
I also checked the number of reads higher in blastx distributed to gene directories.
If the number of reads is less in diamond then why does assembly takes a longer time.

chrisjackson-pellicle closed this as completed Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stitched contig and 'N' characters #80

Stitched contig and 'N' characters #80

Wwwangwh commented Jun 12, 2022

chrisjackson-pellicle commented Jun 14, 2022 •

edited

Loading

Wwwangwh commented Jun 15, 2022

mustafaraza1987 commented Jul 8, 2022

Stitched contig and 'N' characters #80

Stitched contig and 'N' characters #80

Comments

Wwwangwh commented Jun 12, 2022

chrisjackson-pellicle commented Jun 14, 2022 • edited Loading

Wwwangwh commented Jun 15, 2022

mustafaraza1987 commented Jul 8, 2022

chrisjackson-pellicle commented Jun 14, 2022 •

edited

Loading