zhangrengang / TEsorter

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes
https://doi.org/10.1093/hr/uhac017
GNU General Public License v3.0
86 stars 19 forks source link

Benchmark with rice, dmel, and maize #2

Closed oushujun closed 4 years ago

oushujun commented 4 years ago

Hello @zhangrengang,

Thank you so much for developing this neat package. I improved the LTR_retriever classification scheme based on your suggestions by changing the copia classification ratio to 0.9, and label as LTR_retriever_new.

Original scheme in annotate_TE.pl:

$family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3); $family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.3);

New scheme in annotate_TE.pl:

$family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3); $family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.9);

I then benchmark the classification performance of LTR_classifier, LTR_retriever, and LTR_retriever_new. I used the rice curated library, the dmel repbase database, and the maize TE consortium (MTEC) library for this test.

Species Method Database Total LTR Copia Gypsy others Unknown Reclassified unknown Misclassified as other superfamily Misclassified as other class
Rice Curated Curated 409 159 224 19 7 0 0 0
Rice LTR_classifier gydb 308 134 172 0 0 - 2 1
Rice LTR_classifier rexdb 330 142 185 0 0 - 3 0
Rice LTR_retriever TEfam.hmm 353 69 203 0 0 - 77 0
Rice LTR_retriever_new TEfam.hmm 353 138 203 0 4 - 8 0
Dmel Curated Curated 142 10 100 17 15 0 0 0
Dmel LTR_classifier rexdb 67 5 43 0 0 3 10 6
Dmel LTR_retriever_new TEfam.hmm 65 4 41 0 1 4 15 0
Zmays Curated Curated 600 185 244 0 171 0 0 0
Zmays LTR_classifier rexdb 473 170 224 0 0 73 5 1
Zmays LTR_retriever_new TEfam.hmm 460 168 224 0 5 55 7 1

In rice, LTR_retriever_new has significantly improved classification sensitivity for copia elements (thank you for your keen insight!) and has slightly higher overall sensitivity than LTR_classifier with the rexdb database. For maize and drosophila, the two methods, LTR_classifier and LTR_retriever_new, have comparable performance. Besides, LTR_classifier provides accurate classifications for non-LTR and DNA TEs, which make it a general TE classifier. I think you can write a short application note for this nice package. I would love to cite it and incorporate it in the EDTA package.

Thanks again for your work.

Best, Shujun

zhangrengang commented 4 years ago

Hi, @oushujun, glad for your work. I have add the REXdb metazoa_v3 database for working with Metazoa species. It may be helpful, as I do not work with metazoa and the original paper of REXdb focused on plants. About a short application note, do you have any advice? I am not familiar with it.

oushujun commented 4 years ago

Thanks for the update! I benchmarked the REXdb metazoa database in Drosophila:

Species Method Database Total LTR Copia Gypsy others Unknown Reclassified unknown Misclassified as other superfamily Misclassified as other class
Dmel Curated Curated 142 10 100 17 15 0 0 0
Dmel LTR_classifier rexdb 67 5 43 0 0 3 10 6
Dmel LTR_classifier rexdb-metazoa 67 5 49 8 0 4 1 0
Dmel LTR_retriever_new TEfam.hmm 65 4 41 0 1 4 15 0

Despite the same total LTR element number classified, the metazoa database provide more accurate classifications for Gypsy and Bel-Pao (others) superfamilies. So this is definitely an improvement. My follow up question is: does it make sense to combine all these databases together (and remove redundant ones) for classifications of both plants and metazoans?

For an application note, it's rather simple. You just need one figure/table with ~1,000 words to briefly describe the improvements and demonstrate the applications. Here is an example for the LTR_FINDER_parallel package. You may follow the instructions in Bioinformatics.

Best, Shujun

zhangrengang commented 4 years ago

Shujun, I have updated the dedault database as REXdb viridiplantae_v3.0 + metazoa_v3 for classifications of both plants and metazoans. The results might be slightly different.

Thanks for your share. I will prepare the appilcation note soon. I would like also to list you as co-author for your contribution. I will add a usage and some wrappers to make the package more flexibale and easier to use. I also plan to add a module to further classify the unclassified sequences to improve the sensitivity. Do you have any other advice or requirements?

oushujun commented 4 years ago

Thank you for the update, and thank you for kindly adding me as a coauthor. Please let me know if you need anything. You may contact me via oushujun@iastate.edu.

For the package usage, maybe it will be easier to combine the two steps into one line? Also, I was trying to use it to identify TE-contained genes, and I found it's rather slow for large datasets. I tried to change to hmmscan --cpu 36 but it didn't speed up. I ended up splitting the input file into 36 pieces and run 36 jobs of the classifier, which completed in minutes. Maybe this is a way to speed up large inputs internally in the script?

For better compatibility, I converted the python2 code into python3 using this website: https://www.pythonconverter.com/. Nicely, it works! So you may also want to make a py3 version for users/packages only compatible with python3.

Cheers, Shujun

zhangrengang commented 4 years ago

OK, I will update these features in a few days except for the py3 version because I am not able to write or maintain the py3 code now. I will contact you sooner or later to interchange more information.

oushujun commented 4 years ago

Yes, I don't write python in any versions but the website conversion seems to work. So when you have updates on the original code, you may just convert it on the website and test it with python3. I assume the conversion is successful if no error message emits and it produces the same result.

zhangrengang commented 4 years ago

Shujun, I have updated a new version which is easier to use with more features. About the python3 version, I mean that the converted codes may be un-stable or contain bugs that I can not to maintain. I think it is easier to install python2. They can co-exist without conficts. They may confict with the PYTHON envionment variables, especially PYTHONPATH, so .pth files are recommanded to replace PYTHONPATH to import modules. If PYTHONPATH is not set, generally py2 and py3 are harmonious. PYTHONPATH can be disabled by unset PYTHONPATH.

oushujun commented 4 years ago

Thanks for the new features and documentation! I have a situation where python3 was installed in a conda env. When I specify to use python2 in this environment, the default /usr/bin/python2 is recruited. However, this python2 is root restricted and I cannot install python parallel and biopython on it. Installing both python2 and python3 in the same conda environment seems impossible. For this case I may have to convert the python2 code myself and use at my own risk...

For classification of LTR retrotransposons, do you only use superfamily-specific hmms or also use the order of polyproteins?

Best, Shujun

zhangrengang commented 4 years ago

There are several ways for you to use python2:

  1. install python2 modules in your user directory:
    pip install pp --user
    pip install biopython --user

    In my environment, they will be install in ~/.local/lib/python2.7/site-packages. Then set the PYTHONPATH:

    export PYTHONPATH=$PYTHONPATH:$HOME/.local/lib/python2.7/site-packages

    It may confilct with python3; just unset PYTHONPATH or reset PYTHONPATH when using python3.

  2. install your owned python2:
    # after download python2 source code and decompress,
    ./configure --prefix ~/.local
    make && make install

    If succeed, python2 will be installed in ~/.local/bin. Then set PATH:

    export PATH=$HOME/.local/bin:$PATH
    which python

    Then install python2 modules with the newly installed python:

    pip install pp
    pip install biopython

    They should be installed following the python2 install directory: ~/.local/lib/python2.7/site-packages. If pip do not work, install pip:

    easy_install pip
  3. install a new python2 environment with conda:
    conda create -n python27 python=2.7 anaconda
    activate python27
    which python
    pip install pp
    pip install biopython
    deactivate python27

    I have not tested this way. It is theoretically feasible.

For classification of LTR retrotransposons, I only use clade-specific HMMs. See by:

grep NAME database/REXdb_protein_database_viridiplantae_v3.0.hmm
grep NAME database/REXdb_protein_database_metazoa_v3.hmm

The two database is different but I followed their nomenclature.

oushujun commented 4 years ago

Thanks for your detailed guide. Here I want to update the usage of TEsorter under Python3 conda environments:

Install dependencies under python2 --user: python2 -m pip install --user numpy==1.14.3 biopython pp

Call TEsorter with python2: python2 ../TEsorter.py rice6.9.5.liban -p 36