xiezhq / ISEScan

A python pipeline to identify IS (Insertion Sequence) elements in genome and metagenome
Apache License 2.0
79 stars 17 forks source link

Input sequence that contains multiple interspersed unknown bases #23

Closed raven44099 closed 2 years ago

raven44099 commented 4 years ago

This tool is great! And the Instruction (README.md) is very clear.

So I exported my sequence as faste file giving a sequence with multiple gaps (like this: CGTATA????ACACGCCCGTTGTTTT??????????????????GACACTCACGGCGTCCAGTCCGCTTATCGGTGTCTATGCCCCTACAGGCGCTACTCTGACGGCAACGCTAACCTCTGCAAATGGCACTCCAGTGGAGGGTCAGGTCATCAACTTTAGCGTAACGC?????GGG).

Then I deleted the "?" ( representative of unknown base) and run ISEScan. It worked and the prediction directory contained all 7 files. ISEScan found in total 61 transposons. But the problem is, that the .fasta.gff is now referenced to different positions than the "?"-containing original, and I would have to assign 61 transposons manually, what I don't like. Thus I tried to run the full sequence including the "?", but this did not work, indicated by the absence of a "prediction" directory (hmm-, proteome directory and _SN1_exportFirst.fasta.list were created though).

Do you know a simple solution to this problem? Thank you in advance.

xiezhq commented 4 years ago

Hi Raven44099,

You can try replacing '?' in your sequences with 'N', and run ISEScan on the updated sequences. If the generated result files still contain incorrect reference positions or other problems, please let me know.

Thanks, Xie

raven44099 commented 4 years ago

Thank you so much, this makes it a lot easier!