xiezhq / ISEScan

A python pipeline to identify IS (Insertion Sequence) elements in genome and metagenome
Apache License 2.0
79 stars 17 forks source link

Overly large "transposases" #45

Closed clb21565 closed 2 years ago

clb21565 commented 2 years ago

Hi there, this is a lovely tool. I am noticing however that the .faa file produced appears to be translating more than just transposases - including sometimes up to 4+ genes (e.g., see below). I guess I am wondering how to interpret this -- any guidance would be much appreciated.

image

xiezhq commented 2 years ago

Hi clb21565,

I don't know what your question is. Could you give more detials about what the picture is? How did you get the search results in the picture?

Xie

clb21565 commented 2 years ago

Thanks! This image is the result of searching one of the entries in the orf.faa file against NCBI NR. The protein is conspicuously large, see here:

unnamed protein product KKLLIKWATGFLNRCGYVVEPKAKPMDNVPAFIRQMEGNKMSREKLNTTQRALRILKALKGRSLTGLTNK ELCEAIGETPVNVTRAIAFLEAEGFXATVKHWGVWFELSDFADCSKPRDGNAESLGTIGTGASKGASGCF LIKICNQENRTMSELTLSQEQNAVALAAKAMTQDLAEAHEAMGMIKAFTFVGKLATVATLKKLAEVKEAR NYKGLQYVNADGELATVASWEEFCTACGTSRRKVDEDLQNLNQFGEEFMETSQRLGLGYREMRKLRQLPE EARAEIVDADYSETTDKEDLIEKIEDLTAKHAKEKESLTKQLESVKANYDAQAKVIANKDERLNKLDKEL AKKTLLIETQTPDQRGGMLREEAAQISYKAEAILRGQVFQAFEALQTHQEEHGIDHRQFMSGVLAEYQLI LSELKERFNLTDEPTGDNLPEWAKPEYADKPSVEPSIAAILDEVSDAQIWSKDDAMAILPSVLSHWANRV ETAKFGETEKVIDEGCKQTGLSRATFLRQIKPYRPKSNRKVRSDKGKHQLEKAELDLISAAWLHLRQKNG KTMATLERVLDILRANHRIKAEFIDENTGEVRPYSATSVERALRNANLHPDQLLRPAPVVQLQSKHPNHV WQIDRHCVLYYLKETGKGNGLCIMEEGEFYKNKPANVAKVEPQRVWRYVITDHTSGVIYVEYVYGGETAE NVSQCFINAIQPKANKAEPFFGVPKILMFDRGTANTSQMFSHLLHQLDVKVEIPKAKNARAKGQVEKGND IVERQFESGLRFMNVSGLDELNQLAHQWMRYFNGKMVHSRHGRTRYQMWQFIRPEQLIMPPSREICQELM ITALSERVVSDKLEISFESRRYDVRDVPDAKVGEKITVGKNPYRPECVQVQCFERVVDEDGSENLKPYWV VVEPVEVNEYGFRVDAAMIGEEYKAHKKTEFETHKEQAEQLAYGVTNEDDLKRAKKVNKPLFNGEINPYK HIEETNLNWFVPKKGQDHELTTNARRVEQKPVNLVECAKQLKERFPEWNGKHYKNLAKHFSEGVPITTLE DWLQGNKLPEVLNPETKILQLNAPNFDKWRFYVLKLKQVLIDKGVSLRQLAQQMNVSPATVSQLINHNQR VKQWVEFEKNLGSALQSLGIIEPLASLLEMEGTGESLATEPVPSAPKTTDEIKDEIMLLAKQALFPATKK HFGLFRDPFAEDVRSADDVFSSADVRYVREALFQTAKHGGFMAVVGESGAGKSTLRRDLIDRINQENAPI TVIEPYIIAMEDNDVKGKTLKAAHIAEAIISTLSPLESVKRSPEARFRQLHKVLKESVKSGYSNVLIIEE AHALPIPTLKHLKRFFELEDGFKKLLSIVLIGQPELKIKLSERNTEVREVVQRCEIVELAPLDAELERYV EHKLERVGKKLSDIFEEDAFVAVRQRLTAVGRNKTSQSLLYPLAVGNLLTAAMNLAESLGIPKVNGQVVM MCKKVLGSITRANEEAFANFCYDFIKLVINSPEVIVSALIYGIEQFDLVEDENGKKSIEVKLYDKEKENG EESSTIKADVHELNLQTADDVALAIKEIGDLERERVRLATLQADEKAVIDEKYTAKLTALKDKVKPLQKA VQAYCESRRDVLTNGGKQKTAYFPTGEVQWRVKPPAVVAKGLESILDSLRKLGLFRFIRTKEELDKEAML KEPEIARSISGISIREGVEEFVIKPNDXGGAKMTPSAKTERQFMYKEKAEAAARCEQLGNYQQAYNLWCE AMKLATTEKQKNGVALEQIIVILGKASGACEMIDSLEQLKMQLQQAVRQLEQAEKAIDENELPLAQCYVF TAKNLIMKLGLKMT

This is what was returned using ISEscan (default settings). Searching NCBI for this sequence using blastp, I had alignments to multiple proteins suggesting that somehow this was a fused open reading frame of multiple coding regions (as in the picture above). Many of the orf.faa files have proteins like this which are definitely erroneous. Rerunning prodigal on the IS fna files produces multiple ORFs.

Hope this clears things up, but let me know if I can provide additional information. I appreciate the fast response.

Connor

xiezhq commented 2 years ago

You are right. The Fragenescan ISEScan used to predict gene/protein is a good tool for dealing with frameshift issues but its predictions are sometimes quite different (sometimes incorrectly like the prediction in your case) from the predictions reported by other gene prediction tools. ISEScan refines the boundary of the predicted IS element copies when searching for IS element copies, especially for multi-copy IS elements, but does not change boundaries of the predicted transposases. So, there might be very few cases where the transposase reported by ISEScan is larger than the corresponding IS element copy reported by ISEScan. The best solution is to feed ISEScan accurate gene/protein sequences instead of using Fraggenescan or any other single gene prediction tool to predict gene/protein sequences, but I probably will add this feature in ISEScan in the future to allow users to feed gene/protein sequences for their genomes.

Xie

clb21565 commented 2 years ago

Hi there, another follow-up here. We have noticed that many of the predicted IS are also much much too large - for instance some that are on the order of ~50 kbp. Is there a reason why this is happening, and would you have advice on how to fix it?

xiezhq commented 2 years ago

One reason might be:

FragGeneScan used by ISEScan predict a inaccurate gene (later predicted/classifed as transposase by ISEScan because it is hit by trasnpasase model) which either fully covers a real IS element or largely overlapped with a real IS element. ISEScan always to try to extend the predicted transposase till it find/locate the TIR sequences at left and right end. In such case, ISEScan might not be able to find the real TIR sequences (within the predicted gene/transposase) of the real IS element. Insteadly, it could find the fake TIR sequences (outside the predicted gene/transposase of the much larger IS element with incorrect boudaries because it is relatively easy to find two SHORT inverted repeat sequences in a larger space (longer sequence) along the DNA sequence.

There is no perfect (automated) solution to fix it before ISEScan is upgraded with a new feature allowing the user provided accurate gene sequences (actually the translated protein sequences). The only way to fix it is to replace the ISEScan predicted gene/protein sequences with your correct gene/protein sequences and re-run ISEScan:

  1. Find the incorrectlly predicted IS element (e.g. a very large IS element which you think is too large) in the ISEScan predictions.
  2. Copy the nucleic acid sequence of the predicted IS element, and then use other tools or BLAST search to predict or search correct (at least you think it is correct) gene sequence (and protein sequence).
  3. After you obtain the correct gene/protein sequence from step 2 above, please find the proteome (protein sequences in file *.faa in directory results/proteome, e.g. NC_012624.fna.faa) predicted/translated from your DNA sequence(s) in your fasta file (your genome). In the .faa file, you need find all protein sequences whose gene sequences overlap with the incorrectly predicted IS element, and then replace those incorrectly predicted protein sequences with the correct protein sequences. Accordingly, you also need to update the starting and ending positions of the corresponding genes in the corresponding gene/protein description lines starting with '>', which is the last part of the gene/protein description line. For example, you have a description line in file .faa , >gi|228288719|ref|NC_012624.1|_995_2377_+, in which 995_2377_+ shows the starting and ending positions of a gene on strand + are 995 and 2377, respectively.
  4. After you replace all incorrectly predicted genes/proteins in .faa file, you need to delete the corresponding HMM hits (two files for each .faa) in directory results/hmm, e.g. clusters.faa.hmm.NC_012624.fna.faa and clusters.single.faa.NC_012624.fna.faa.
  5. Re-run ISEScan as you did in the last time, ISEScan will skip translating your genome into proteome but will search/predict transposases and IS elements using the protein sequences and gene positions in the updated file .faa.

The files in results/proteome are generated by FragGeneScan. The files in results/hmm are generated by HMMER.

Hope this helps.

Xie

clb21565 commented 2 years ago

Xie, thanks for the detailed solution here- I will try this out.