-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agat_sp_extract_sequences.pl does not incorporate CDS feature ID in headers #450
Comments
Sounds fair.
For the multicistronic problem, this has never been taken into account... Please open another issue that it can be discussed (At least other user can realize also this AGAT's limitation) |
CDS chunks may share the same identifier, in this case how to differentiate the different extracted CDS chunks? |
I'm not entirely sure what exactly CDS chunks refers to. Are you referring to chunks as existent on different exons? |
Yes. |
Ok. I'm not familiar enough with annotation formats and conventions to know what the best approach is in case it's important to list what chunks a CDS is constructed from. I was simply thinking that it would make sense to list the identifier used for the CDS in the header of the fasta file if these are present in the gff file. |
Describe the bug
According to the documentation, the headers created by the script are formatted:
However, when applying this script to extract sequences of CDS features the header id's contain the id of the mRNA feature, rather than that of the selected feature CDS.
e.g.
>transcript:ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds
instead of
>CDS:ENSP00000431562 gene=gene:ENSG00000182378 seq_id=X type=cds
General (please complete the following information):
v1.4
Singularity
Ubuntu Linux
To Reproduce
Simply run the script on any gff3 file containing ID fields in the CDS attribute fields.
E.g., using https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/.
agat_sp_extract_sequences.pl -g Homo_sapiens.GRCh38.110.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa -o cdss.fa -t cds
Expected behavior
Use the CDS ID in the header rather than the transcript/mRNA ID.
Additional context
Somewhat off-topic, but I was trying to apply this tool on gff3 files containing multiple CDS ID's per mRNA (multicistronic). It seems this is currently not supported.
The text was updated successfully, but these errors were encountered: