-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplication checks #64
Comments
Careful with duplicates at contig level, e.g.
Mitochondrial & plastid are only 'contaminants' in some contexts but part of the whole genome in others (as in you have nuclear genome, organellar genomes - combined they are the whole genome) - need to ensure not to mark mitochondrial and chloroplast genomes in themselves as contaminants if they are not. Plant nuclear genomes have chloroplast gene insertions in them - need to make sure you are not marking these regions as contamination |
Hi @CeciliaDeng Does the above comment from Ross answer your question? Is there a tool you have in mind for duplicate detection? |
Hi @GallVp and @rosscrowhurst, We encountered duplicated sequences before in our NCBI submission, in particular de novo assemblies from short reads. Yes, the duplicated seqs are usually at contig level, with exactly the same sequence but different SeqIDs. I ran 'ml seqkit; seqkit rmdup -s -o $checkedFasta $inputFasta' to remove such items |
For genomes we downloaded from public domain, sometimes there exists duplicated seqIDs and 'samtools faidx $inputFasta' will complain and exit. In that case we can use 'seqkit rmdup -n -o $outFile $inputFasta' to remove seqs with the same ID. However their sequences could be different even with the same SeqID, in that case I usually append '.1', '.2' and so on for the sequences with the same ID. |
Thank you @CeciliaDeng This is very useful information. I will add following to fasta validation:
|
We are using py_fasta_validator to validate |
From @CeciliaDeng
BTW, we haven't checked duplicate sequences in assembly, have we? I can't remember if the QC pipeline checks for mitochondria/plastids/ribosomal rna contaminations. If not, we may list them as 'todo for future release'?
The text was updated successfully, but these errors were encountered: