-
Notifications
You must be signed in to change notification settings - Fork 50
FAQ
Consistent with convention, RagTag expects that genome assemblies only represent one homologous genome at a time (though they may represent many homoeologous genomes). Please read Heng Li’s blog post to learn about how genome assemblies are used to represent samples with more than one haplotype.
I have one addendum to these posts, which is that despite the use of the term “dual”, the principle of dual assemblies should, in theory, be applicable to samples with more than two haplotypes (e.g. autopolyploids). Also, we can consider assemblies representing monoploid genomes or genomes with negligible heterozygosity (inbreds, double haploids etc.) as “haplotype-resolved”.
Using terminology from these blog posts, we can say that, in keeping with convention, RagTag expects input genome assemblies to be one of the following:
- A “squashed” / "collapsed" / "consensus" assembly
- A “primary” assembly
- One assembly from a set of “dual” assemblies
- One assembly from a set of “haplotype-resolved” assemblies
So yes, RagTag can work with multiple haplotypes and polyploids. One may just need to run RagTag multiple times on separate assemblies, where applicable.
What about false duplications?
A common type of misassembly is false duplication of alleles. Though genome assemblies should represent just one homologous allele at each locus, sometimes multiple alleles from the same locus are incorrectly placed next to each other in a genome assembly [1]. Of all the RagTag tools and utilities, only correct
attempts to correct misassemblies, and even so, correct
cannot correct false duplications. Therefore, such misassemblies should be corrected prior to using RagTag. Some genome assemblers, such as Hifiasm, will correct these errors automatically. Or, one can use tools such as purge_dups to remove such errors.
1. Rhie, A., McCarthy, S.A., Fedrigo, O. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
Suppose you have two de novo assemblies and you would like to use one to patch the other. Which should be the "target" (the assembly being patched) and which should be the "query" (the assembly that will provide the patches)? RagTag will technically work in either direction. But in order to produce the most accurate assembly, in my opinion, there are two main things to consider.
The majority of the final patched assembly will come directly from the target assembly. Therefore, the haplotype(s) represented in the target assembly will be dominant. If both de novo assemblies are haplotype-resolved and they represent the same haplotype with similar phasing accuracy, then either configuration is equally valid with respect to haplotype representation. However, if the assemblies represent different haplotypes or they have different phasing accuracies, then one should consider this when deciding how to use these assemblies for patching. I recommend using the de novo assembly with the best phasing accuracy and/or the assembly with the haplotype of particular interest (if applicable) as the target assembly.
Again, the majority of the final patched assembly will come directly from the target assembly. Therefore, if one de novo assembly is much more accurate than the other, then I recommend using the more accurate assembly as the target. In the RagTag preprint and in the human T2T publication (which did not use RagTag but instead used manual patching), we used the HiFi-based assembly as the target and the Nanopore-based assembly as the query because the HiFi-based assemblies were more accurate.
While these are the two most relevant considerations for modern genome assemblies, other factors can be relevant (e.g. contiguity). Ultimately, it up to the discretion of the user to decide which approach is best.
Are these docs confusing or incomplete? Please open an issue and let me know.