Benchmark with rice, dmel, and maize

oushujun commented 5 years ago

Hello @zhangrengang,

Thank you so much for developing this neat package. I improved the LTR_retriever classification scheme based on your suggestions by changing the copia classification ratio to 0.9, and label as LTR_retriever_new.

Original scheme in annotate_TE.pl:

$family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3); $family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.3);

New scheme in annotate_TE.pl:

$family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3); $family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.9);

I then benchmark the classification performance of LTR_classifier, LTR_retriever, and LTR_retriever_new. I used the rice curated library, the dmel repbase database, and the maize TE consortium (MTEC) library for this test.

Species	Method	Database	Total LTR	Copia	Gypsy	others	Unknown	Reclassified unknown	Misclassified as other superfamily	Misclassified as other class
Rice	Curated	Curated	409	159	224	19	7	0	0	0
Rice	LTR_classifier	gydb	308	134	172	0	0	-	2	1
Rice	LTR_classifier	rexdb	330	142	185	0	0	-	3	0
Rice	LTR_retriever	TEfam.hmm	353	69	203	0	0	-	77	0
Rice	LTR_retriever_new	TEfam.hmm	353	138	203	0	4	-	8	0
Dmel	Curated	Curated	142	10	100	17	15	0	0	0
Dmel	LTR_classifier	rexdb	67	5	43	0	0	3	10	6
Dmel	LTR_retriever_new	TEfam.hmm	65	4	41	0	1	4	15	0
Zmays	Curated	Curated	600	185	244	0	171	0	0	0
Zmays	LTR_classifier	rexdb	473	170	224	0	0	73	5	1
Zmays	LTR_retriever_new	TEfam.hmm	460	168	224	0	5	55	7	1

In rice, LTR_retriever_new has significantly improved classification sensitivity for copia elements (thank you for your keen insight!) and has slightly higher overall sensitivity than LTR_classifier with the rexdb database. For maize and drosophila, the two methods, LTR_classifier and LTR_retriever_new, have comparable performance. Besides, LTR_classifier provides accurate classifications for non-LTR and DNA TEs, which make it a general TE classifier. I think you can write a short application note for this nice package. I would love to cite it and incorporate it in the EDTA package.

Thanks again for your work.

Best, Shujun

zhangrengang commented 5 years ago

Hi, @oushujun, glad for your work. I have add the REXdb metazoa_v3 database for working with Metazoa species. It may be helpful, as I do not work with metazoa and the original paper of REXdb focused on plants. About a short application note, do you have any advice? I am not familiar with it.

oushujun commented 5 years ago

Thanks for the update! I benchmarked the REXdb metazoa database in Drosophila:

Species	Method	Database	Total LTR	Copia	Gypsy	others	Unknown	Reclassified unknown	Misclassified as other superfamily	Misclassified as other class
Dmel	Curated	Curated	142	10	100	17	15	0	0	0
Dmel	LTR_classifier	rexdb	67	5	43	0	0	3	10	6
Dmel	LTR_classifier	rexdb-metazoa	67	5	49	8	0	4	1	0
Dmel	LTR_retriever_new	TEfam.hmm	65	4	41	0	1	4	15	0

Despite the same total LTR element number classified, the metazoa database provide more accurate classifications for Gypsy and Bel-Pao (others) superfamilies. So this is definitely an improvement. My follow up question is: does it make sense to combine all these databases together (and remove redundant ones) for classifications of both plants and metazoans?

For an application note, it's rather simple. You just need one figure/table with ~1,000 words to briefly describe the improvements and demonstrate the applications. Here is an example for the LTR_FINDER_parallel package. You may follow the instructions in Bioinformatics.

Best, Shujun

zhangrengang commented 5 years ago

Shujun, I have updated the dedault database as REXdb viridiplantae_v3.0 + metazoa_v3 for classifications of both plants and metazoans. The results might be slightly different.

Thanks for your share. I will prepare the appilcation note soon. I would like also to list you as co-author for your contribution. I will add a usage and some wrappers to make the package more flexibale and easier to use. I also plan to add a module to further classify the unclassified sequences to improve the sensitivity. Do you have any other advice or requirements?

oushujun commented 5 years ago

Thank you for the update, and thank you for kindly adding me as a coauthor. Please let me know if you need anything. You may contact me via oushujun@iastate.edu.

For the package usage, maybe it will be easier to combine the two steps into one line? Also, I was trying to use it to identify TE-contained genes, and I found it's rather slow for large datasets. I tried to change to hmmscan --cpu 36 but it didn't speed up. I ended up splitting the input file into 36 pieces and run 36 jobs of the classifier, which completed in minutes. Maybe this is a way to speed up large inputs internally in the script?

For better compatibility, I converted the python2 code into python3 using this website: https://www.pythonconverter.com/. Nicely, it works! So you may also want to make a py3 version for users/packages only compatible with python3.

Cheers, Shujun

zhangrengang commented 5 years ago

OK, I will update these features in a few days except for the py3 version because I am not able to write or maintain the py3 code now. I will contact you sooner or later to interchange more information.

oushujun commented 5 years ago

Yes, I don't write python in any versions but the website conversion seems to work. So when you have updates on the original code, you may just convert it on the website and test it with python3. I assume the conversion is successful if no error message emits and it produces the same result.

zhangrengang commented 5 years ago

Shujun, I have updated a new version which is easier to use with more features. About the python3 version, I mean that the converted codes may be un-stable or contain bugs that I can not to maintain. I think it is easier to install python2. They can co-exist without conficts. They may confict with the PYTHON envionment variables, especially PYTHONPATH, so .pth files are recommanded to replace PYTHONPATH to import modules. If PYTHONPATH is not set, generally py2 and py3 are harmonious. PYTHONPATH can be disabled by unset PYTHONPATH.

oushujun commented 5 years ago

Thanks for the new features and documentation! I have a situation where python3 was installed in a conda env. When I specify to use python2 in this environment, the default /usr/bin/python2 is recruited. However, this python2 is root restricted and I cannot install python parallel and biopython on it. Installing both python2 and python3 in the same conda environment seems impossible. For this case I may have to convert the python2 code myself and use at my own risk...

For classification of LTR retrotransposons, do you only use superfamily-specific hmms or also use the order of polyproteins?

Best, Shujun

zhangrengang commented 5 years ago

There are several ways for you to use python2:

install python2 modules in your user directory:
```
pip install pp --user
pip install biopython --user
```
In my environment, they will be install in ~/.local/lib/python2.7/site-packages. Then set the PYTHONPATH:
```
export PYTHONPATH=$PYTHONPATH:$HOME/.local/lib/python2.7/site-packages
```
It may confilct with python3; just unset PYTHONPATH or reset PYTHONPATH when using python3.
install your owned python2:
```
# after download python2 source code and decompress,
./configure --prefix ~/.local
make && make install
```
If succeed, python2 will be installed in ~/.local/bin. Then set PATH:
```
export PATH=$HOME/.local/bin:$PATH
which python
```
Then install python2 modules with the newly installed python:
```
pip install pp
pip install biopython
```
They should be installed following the python2 install directory: ~/.local/lib/python2.7/site-packages. If pip do not work, install pip:
```
easy_install pip
```

install a new python2 environment with conda:

conda create -n python27 python=2.7 anaconda
activate python27
which python
pip install pp
pip install biopython
deactivate python27

I have not tested this way. It is theoretically feasible.

For classification of LTR retrotransposons, I only use clade-specific HMMs. See by:

grep NAME database/REXdb_protein_database_viridiplantae_v3.0.hmm
grep NAME database/REXdb_protein_database_metazoa_v3.hmm

The two database is different but I followed their nomenclature.

oushujun commented 5 years ago

Thanks for your detailed guide. Here I want to update the usage of TEsorter under Python3 conda environments:

Install dependencies under python2 --user: python2 -m pip install --user numpy==1.14.3 biopython pp

Call TEsorter with python2: python2 ../TEsorter.py rice6.9.5.liban -p 36

zhangrengang / TEsorter

Benchmark with rice, dmel, and maize #2