xiezhq / ISEScan

A python pipeline to identify IS (Insertion Sequence) elements in genome and metagenome
Apache License 2.0
84 stars 16 forks source link

IS subgroups #42

Closed SamuelGreenrod closed 2 years ago

SamuelGreenrod commented 2 years ago

Hi there,

How are the IS subgroups determined? I've run ISEScan on some genomes and blasted representative ISs from multiple subgroups against ISFinder and found they have the same top hit (E-value < 10^-4). This suggests that different subgroups may actually contain the same sequence so I'm confused how they have been separated.

Are the subgroups decided based on the transposase? In the source code you've mentioned they are determined with CD-Hit clustering but it doesn't say whether this is nucleotide sequences of the whole IS element or just the transposase amino acid sequence. Thanks.

xiezhq commented 2 years ago

Hi Samuel,

Thanks for your interest in ISEScan.

Yes, the subgroups are determined by clusters of transposase. The CD-Hit was used to cluster amino acid sequences of transposases as the homology search of amino acid sequences is more sensitive and reliable than the homology search of nucleotide sequences.

Zhiqun Xie

SamuelGreenrod commented 2 years ago

Great, thank you!