xiezhq / ISEScan

A python pipeline to identify IS (Insertion Sequence) elements in genome and metagenome
Apache License 2.0
79 stars 17 forks source link

Is fasta file with predicted gene sequence OK? #12

Closed Xiaojun928 closed 5 years ago

Xiaojun928 commented 5 years ago

Hi Zhiqun,

Thanks for developing such useful software. It seems that genomic sequences are required by ISEscan. But the some predicted IS regions cover partial a gene that I predicted with porkka (exp1). And some of them covers more than two genes predicted by prokka (exp2). Here are corresponding examples:

exp1 predicted IS seqs image gene position predicted by prokka image

exp2 predicted IS seqs image genes position predicted by prokka image

So I tried to use CDS sequences as an input for ISEscan, The results match well with predicted genes by prokka. So I was wondering if I can use CDS sequences directly when running ISEscan?

Your suggestion is really appreciated!

Best, Xiaojun

xiezhq commented 5 years ago

Hi Xiaojun,

ISEScan treats each sequence in a fasta file (genome or metagenome sequence file required by ISEScan) as the independent 'genome' sequence. For example, if you put three sequences in a fasta file, ISEScan will identify insertion sequence elements in three independent 'genome' sequences.

So, technically, you can try putting only CDS sequences in a fasta file but you should pay attention to some possible issues:

  1. ISEScan will only report partial IS elements without TIRs because TIR sequences usually locate out of the CDS sequence of transposase gene. Please use ISEScan-1.7 or later to get partial IS elements reported by default.

  2. ISEScan uses FragGeneScan to predict gene and translate gene sequence into peptide sequence. When you put in the fasta file the gene sequences different from the genes predicted by FragGeneScan, ISEScan in some cases might not predict the same gene sequences as what are predicted by prokka. You will probably (maybe not) miss some expected IS elements (partial IS elements in case of only CDS sequences in the fasta file) in the result file output by ISEScan.

  3. Some IS elements might include one transposase gene and one accessory gene (non-transposase gene). In this case, ISEScan might not be able to identify the full IS elements because each CDS only cover one gene.

  4. In case of frameshift translation, ISEScan might report the same transpoase as the different partial IS elements.

Hope it helps.

Xie

Xiaojun928 commented 5 years ago

Thanks a lot for your help! Now I think it's better to use assembled contigs as input.

xiezhq commented 5 years ago

Yes, you are right. The best option for you is probably to put the assembled contigs in a fasta file which is the input of ISEScan. The large amount of available draft genome sequences of bacteria (or the assembled contigs of metagenome sequences) are just the contigs in a fasta file, which can be treated by ISEScan very well.