Closed oushujun closed 5 years ago
Hi, @oushujun, glad for your work. I have add the REXdb metazoa_v3 database for working with Metazoa species. It may be helpful, as I do not work with metazoa and the original paper of REXdb focused on plants. About a short application note, do you have any advice? I am not familiar with it.
Thanks for the update! I benchmarked the REXdb metazoa database in Drosophila:
Species | Method | Database | Total LTR | Copia | Gypsy | others | Unknown | Reclassified unknown | Misclassified as other superfamily | Misclassified as other class |
---|---|---|---|---|---|---|---|---|---|---|
Dmel | Curated | Curated | 142 | 10 | 100 | 17 | 15 | 0 | 0 | 0 |
Dmel | LTR_classifier | rexdb | 67 | 5 | 43 | 0 | 0 | 3 | 10 | 6 |
Dmel | LTR_classifier | rexdb-metazoa | 67 | 5 | 49 | 8 | 0 | 4 | 1 | 0 |
Dmel | LTR_retriever_new | TEfam.hmm | 65 | 4 | 41 | 0 | 1 | 4 | 15 | 0 |
Despite the same total LTR element number classified, the metazoa database provide more accurate classifications for Gypsy
and Bel-Pao
(others) superfamilies. So this is definitely an improvement. My follow up question is: does it make sense to combine all these databases together (and remove redundant ones) for classifications of both plants and metazoans?
For an application note, it's rather simple. You just need one figure/table with ~1,000 words to briefly describe the improvements and demonstrate the applications. Here is an example for the LTR_FINDER_parallel package. You may follow the instructions in Bioinformatics.
Best, Shujun
Shujun, I have updated the dedault database as REXdb viridiplantae_v3.0 + metazoa_v3 for classifications of both plants and metazoans. The results might be slightly different.
Thanks for your share. I will prepare the appilcation note soon. I would like also to list you as co-author for your contribution. I will add a usage and some wrappers to make the package more flexibale and easier to use. I also plan to add a module to further classify the unclassified sequences to improve the sensitivity. Do you have any other advice or requirements?
Thank you for the update, and thank you for kindly adding me as a coauthor. Please let me know if you need anything. You may contact me via oushujun@iastate.edu.
For the package usage, maybe it will be easier to combine the two steps into one line? Also, I was trying to use it to identify TE-contained genes, and I found it's rather slow for large datasets. I tried to change to hmmscan --cpu 36
but it didn't speed up. I ended up splitting the input file into 36 pieces and run 36 jobs of the classifier, which completed in minutes. Maybe this is a way to speed up large inputs internally in the script?
For better compatibility, I converted the python2 code into python3 using this website: https://www.pythonconverter.com/. Nicely, it works! So you may also want to make a py3 version for users/packages only compatible with python3.
Cheers, Shujun
OK, I will update these features in a few days except for the py3 version because I am not able to write or maintain the py3 code now. I will contact you sooner or later to interchange more information.
Yes, I don't write python in any versions but the website conversion seems to work. So when you have updates on the original code, you may just convert it on the website and test it with python3. I assume the conversion is successful if no error message emits and it produces the same result.
Shujun, I have updated a new version which is easier to use with more features.
About the python3 version, I mean that the converted codes may be un-stable or contain bugs that I can not to maintain. I think it is easier to install python2. They can co-exist without conficts. They may confict with the PYTHON envionment variables, especially PYTHONPATH, so .pth
files are recommanded to replace PYTHONPATH to import modules. If PYTHONPATH is not set, generally py2 and py3 are harmonious. PYTHONPATH can be disabled by unset PYTHONPATH
.
Thanks for the new features and documentation! I have a situation where python3 was installed in a conda env. When I specify to use python2 in this environment, the default /usr/bin/python2
is recruited. However, this python2 is root restricted and I cannot install python parallel
and biopython
on it. Installing both python2 and python3 in the same conda environment seems impossible. For this case I may have to convert the python2 code myself and use at my own risk...
For classification of LTR retrotransposons, do you only use superfamily-specific hmms or also use the order of polyproteins?
Best, Shujun
There are several ways for you to use python2:
pip install pp --user
pip install biopython --user
In my environment, they will be install in ~/.local/lib/python2.7/site-packages. Then set the PYTHONPATH
:
export PYTHONPATH=$PYTHONPATH:$HOME/.local/lib/python2.7/site-packages
It may confilct with python3; just unset PYTHONPATH
or reset PYTHONPATH
when using python3.
# after download python2 source code and decompress,
./configure --prefix ~/.local
make && make install
If succeed, python2 will be installed in ~/.local/bin. Then set PATH
:
export PATH=$HOME/.local/bin:$PATH
which python
Then install python2 modules with the newly installed python:
pip install pp
pip install biopython
They should be installed following the python2 install directory: ~/.local/lib/python2.7/site-packages
.
If pip
do not work, install pip:
easy_install pip
conda create -n python27 python=2.7 anaconda
activate python27
which python
pip install pp
pip install biopython
deactivate python27
I have not tested this way. It is theoretically feasible.
For classification of LTR retrotransposons, I only use clade-specific HMMs. See by:
grep NAME database/REXdb_protein_database_viridiplantae_v3.0.hmm
grep NAME database/REXdb_protein_database_metazoa_v3.hmm
The two database is different but I followed their nomenclature.
Thanks for your detailed guide. Here I want to update the usage of TEsorter under Python3 conda environments:
Install dependencies under python2 --user
:
python2 -m pip install --user numpy==1.14.3 biopython pp
Call TEsorter with python2
:
python2 ../TEsorter.py rice6.9.5.liban -p 36
Hello @zhangrengang,
Thank you so much for developing this neat package. I improved the
LTR_retriever
classification scheme based on your suggestions by changing the copia classification ratio to 0.9, and label asLTR_retriever_new
.Original scheme in annotate_TE.pl:
New scheme in annotate_TE.pl:
I then benchmark the classification performance of
LTR_classifier
,LTR_retriever
, andLTR_retriever_new
. I used the rice curated library, the dmel repbase database, and the maize TE consortium (MTEC) library for this test.In rice,
LTR_retriever_new
has significantly improved classification sensitivity forcopia
elements (thank you for your keen insight!) and has slightly higher overall sensitivity thanLTR_classifier
with therexdb
database. For maize and drosophila, the two methods,LTR_classifier
andLTR_retriever_new
, have comparable performance. Besides,LTR_classifier
provides accurate classifications for non-LTR and DNA TEs, which make it a general TE classifier. I think you can write a short application note for this nice package. I would love to cite it and incorporate it in the EDTA package.Thanks again for your work.
Best, Shujun