Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are there any hints to work with non-plant species? #91

Closed
DrHogart opened this issue Jun 18, 2020 · 5 comments
Closed

Are there any hints to work with non-plant species? #91

DrHogart opened this issue Jun 18, 2020 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@DrHogart
Copy link

Hi,
I'm trying to explore the TE content in mosquitos genome. As for as I understand the EDTA pipeline with developed to work with plant species and was inspired by this guide. I've found the undocumented option in the EDTA that can help to filter-out any protein-coding genes from the predicted TE families - $protlib (EDTA_raw.pl) referencing to the cleaned plant proteome. Obviously, I can change the link to the cleaned (w/o any traces of TE-derived proteins) mosquito-specific proteome. The question is are there any other tweaks that can help to work with other genomes rather than plants in the EDTA? I mean the discovery, filtering and cleaning options.

@oushujun
Copy link
Owner

Hi, yes, you don't need to tweak the code, but just provide a mosquito CDS file to the program (--cds) to filter protein-coding sequences in the TE annotation. Also, you may want to use --sensitive 1 to identify non-LTR retrotransposons (by RepeatModeler). Or if you have a manually curated set of TEs, please give it to the program via --curatedlib. The set does not have to be complete and comprehensive, but please make sure of the authenticity of the provided elements. There are many non-plant applications of this program as you may find them here #15

Best,
Shujun

@oushujun oushujun added the help wanted Extra attention is needed label Jun 18, 2020
@DrHogart
Copy link
Author

My genome is novel, just assembled, and gene annotation is not available yet. So, I prefer to use $protlib with proteins from the related species.
My question arose after the reading of RM2 paper, in which they show that EDTA outperforms RM2 in the term of sensitivity only for plants but not for drosophila. So, I'm wondering what kind of settings may be tuned in EDTA to increase its sensitivity.
Also, there are a lot of if ($beta2==1) subroutines inside the code that adds some additional cleaning to the predicted sequences. Did you test this functionality with the reference species?

@oushujun
Copy link
Owner

oushujun commented Jun 18, 2020 via email

@DrHogart
Copy link
Author

Thanks.
Last question - why EDTA doesn't cluster the final TElib? CD-HIT and usearch shows that there are some redundant sequencies. E.g.

>Cluster 214
0       197nt, >TE_00000985#DNA/DTA... at 32:197:753:918/+/97.59%                                        
1       2782nt, >TE_00001061#DNA/DTA... *  

@oushujun
Copy link
Owner

The final TElib could have some level of redundancy but the highly redundant part should have been removed. Some sequences may share quite a bit of similarity with others but didn't meet the clustering threshold and will be kept as two sequences. You may use other clustering methods to perform extra clusterings.

@oushujun oushujun closed this as completed Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants