Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the final assembly function of the number of sequences ? #666

Closed
ImagoXV opened this issue Jan 30, 2024 · 6 comments
Closed

Is the final assembly function of the number of sequences ? #666

ImagoXV opened this issue Jan 30, 2024 · 6 comments
Labels

Comments

@ImagoXV
Copy link

ImagoXV commented Jan 30, 2024

Hi there, I have a question.

Context

In the context of a master's degree practical course I'm giving, I wanted to let the students go through some metagenomics basics steps. So I am preparing a small dataset running fast.

Problem

I've used flye --nano-hq seq.fastq --out-dir Assembly --threads 7 --min-overlap 1000 --iterations 4 to reconstruct my Nanopore acquired metagenome.
It is a tropical soil metagenome, composed with vastly unknown genomes. So no reference available at all.

I play with the assembly graph with Bandage.
Find MAGs that I like. Notably one nice circular.

I realign my extracted MAG against my whole fastq metagenome.

I take the aligned sequences, feed flye with them, aiming to reconstruct the nice looking circular MAG I obtained at first.

You would think that building the MAG with only the sequences mapping on it would give the same reconstruction, or maybe, even something cleaner.

However, that's absolutely not what I obtain. I do not get a circular contig anymore, but a bunch of contigs.

I am highly puzzled by this situation.

It can mean several things :

  • I don't understand how Flye (or other assembler) works
  • My initial reconstruction could (are ?) reconstruction artifacts
  • The assembly process is less reproducible than I thought

I've discussed it with some bioinformaticians. I've been told that the proportion of the overlapping sequences on the whole dataset can influence the reconstruction choices and path in the assembly graph.

Please enlight me.

Arthur

@mikolmogorov
Copy link
Owner

Hi Arthur,

When you realign all reads on a single contig, this may recruit other reads that have some homology from other species (e.g. repeats or HGT events). This may result in contig breaks in non-meta mode (may also be in meta mode too).

@ImagoXV
Copy link
Author

ImagoXV commented Feb 6, 2024

Dear Fenderglass, thanks for your answer.

Maybe there is a way to find all recruited reads to a form a contig from the intermediate minimap files ?
If I manage to extract such sequences, do you think it will absolutely be the same reconstructed object, independently of the matching sequences proportion in the global dataset ?

@mikolmogorov
Copy link
Owner

I think if you align reads against the entire assembly, rather than just cont contig, this should help. But there is no guarantee that you'll get identical assembly, because the algorithm is heuristic.

@ImagoXV
Copy link
Author

ImagoXV commented Feb 8, 2024

Thank you very much for your answer.

Correct me if I'm wrong, but there's no way to set a seed for flye to enhance reproducibility right ?

Can we set system RNG seed maybe to enhance reproducibility regarding the heuristic step ?

How well would you consider our blind reconstructions of unknown environmental MAGs this way ? Would you trust them ?

@mikolmogorov
Copy link
Owner

Determinism of flye on identical output is discussed here: #640
If the input reads are different, there is no way to guarantee that the assembly will be identical.

@ImagoXV
Copy link
Author

ImagoXV commented Feb 13, 2024

Ok I understand, thank you. I saw the possible future add of a rng seed setting. Great idea. Thanks !

Arthur

@ImagoXV ImagoXV closed this as completed Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants