Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stitched contig and 'N' characters #80

Closed
Wwwangwh opened this issue Jun 12, 2022 · 3 comments
Closed

Stitched contig and 'N' characters #80

Wwwangwh opened this issue Jun 12, 2022 · 3 comments

Comments

@Wwwangwh
Copy link

Hello,
Thank you for updating the projects to 2.0, it helpes a lot,but I'm a little confused about some optional arguments.

  1. --no_padding_supercontigs
    After the update, we found some ‘N’ in the output sequence.
    As optional arguments described,
    If the flag --run_intronerate is provided, HybPiper will attempt to recover intron sequences (if present for a given gene/sample) and a supercontig sequence. The supercontig sequence comprises both exons and introns; in cases where it has been constructed from more than one SPAdes contig, HybPiper will add a stretch of 10 'N' characters bewtween abutting contigs. This can be turned off using the flag --no_padding_supercontigs.
    I wonder if the write of ‘N’ is related to ’ --run_intronerate ‘ or not. Whether in all cases 'N' is added between contigs from different sources. Because we also found the existence of ‘N’ when we assemble gene without adding --run_intronerate.

2.Whether the extracted gene sequence will contain introns by default ( hybpiper assemble -t_dna prb.fasta -r *.fasta --prefix --bwa)
We found some genes with more than 100% coverage rates (GenesAt150pct), Is this because introns are extracted or species specificity?

Thank you in advance!

@chrisjackson-pellicle
Copy link
Collaborator

chrisjackson-pellicle commented Jun 14, 2022

Hi @Wwwangwh,

  1. The Ns you're seeing in some of your assembled nucleotide gene sequences (and the corresponding Xs in the translated protein gene sequences) are different to the fixed stretches of 10 Ns inserted between SPAdes contigs in your supercontig sequences (the latter supercontig sequences are produced by providing the --run_intronerate flag). The following info is currently buried in the change_log.md file, but I'll add it the main Wiki as well:

    In cases where HybPiper recovers sequence for multiple non-contiguous segments of a gene, the gaps between the segments will be padded with a number of 'N' characters. The number of Ns corresponds to the number of amino acids in 'best' protein reference for that gene that do not have corresponding SPAdes contig hits, multiplied by 3 to convert to nucleotides.

    In Hybpiper 1.x, if non-contiguous segments were recovered for a given gene (e.g. there might be partial sequence for the 5' end and the 3' end, but no sequence for the middle section of a gene), these sequences were simply concatenated. This meant that it was up to the downstream alignment software to split the concatenated sequence at the correct point, which could easily introduce errors into the alignment. In HybPiper 2 I inserted a number of Ns proportional to any gaps to a) signal to the user that the exon sequences recovered are not contiguous (at least relative to the chosen reference from the target file - I guess it's always possible that a given taxon is actually missing a section of a gene relative to the reference); and b) to increase the accuracy of downstream alignments in these scenarios.

    Would it be useful to add an optional flag that reproduces HybPiper 1.x behaviour, i.e., simply concatenating non-contiguous exon sequences?

    N.B. These Ns are stripped from the gene sequences when recovering length statistics with hybpiper stats, so they don't artificially 'improve' your recovery length statistics.

  2. The extracted gene sequences (i.e. the *.FNA and *.FAA sequences) should not contain introns. The only scenario where I can imagine this happening is if some of your target file sequences contain introns (they shouldn't). Sequences with introns are only recovered by providing the flag --run_intronerate to command hybpiper assemble, and then recovering the supercontig (or intron for introns-only) sequences using hybpiper retrieve_sequences.

    The gene % length recovery stats (GenesAt75pct, GenesAt150pct, etc) are calculated by comparing the length of a given assembled gene to the mean length of the representative gene sequences in your target file. So, if:

    a) For geneX, your target file contains 3 representative sequences that are 400 bp, and one that is 750 bp;
    b) This latter 750 bp sequence is chosen by HybPiper as the 'best' reference sequence for geneX for sampleA;
    c) HybPiper assembles a ~750 bp sequence for geneX, sampleA;

    ...then the sequence for geneX, sampleA will be reported in the GenesAt150pct category (as the mean reference length is 487.5 bp, and 750 > (1.5 * 487.5).

    Do the genes you're seeing in the GenesAt150pct category fall in to this scenario? Happy to troubleshoot further if not.

Cheers,

Chris

@Wwwangwh
Copy link
Author

Hi Chris,

Thank you for your reply,the explanation about Ns and coverage rates is detailed.

The consideration about downstream analysis is very considered,so I suppose it's unnecessary to add an optional flag to reproduces HybPiper 1.x behaviour. For those who want to achieve this result, they can easily do so by command Line.

Wangwh

@mustafaraza1987
Copy link

hi, Chris
I want to know why the diamond and blastx assembly differ in time.
Also, I noticed that the diamond took 57 minutes in a sample and spades took 31 minutes.
In the same sample, blastx took 1 hour 43 minutes and in spades 21 minutes.
I also checked the number of reads higher in blastx distributed to gene directories.
If the number of reads is less in diamond then why does assembly takes a longer time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants