xiezhq / ISEScan

A python pipeline to identify IS (Insertion Sequence) elements in genome and metagenome
Apache License 2.0
79 stars 17 forks source link

Missing output files for 2 assemblies out of a large dataset #40

Closed biflorenzi closed 2 years ago

biflorenzi commented 3 years ago

Hi @xiezhq , I ran your tool on a large dataset, without encountering any issues with the temporary files, thanks for the new version! However, I am missing the 9 output files of two assemblies out of the whole dataset. The .list files for these assemblies are still present (not deleted as the others have been), the hmm and proteome files are present as well. It is not a problem at all since it is only 2 assemblies, but I would like to understand what happened. Maybe you would have an idea of what could have happened? The input assemblies look normal to my eye, and going through the log file did not help me (it is quite long, but it contains a lot of warning like: Warning: no significant hit with E value etc... ).

Thanks in advance! Biancamaria

xiezhq commented 3 years ago

Hi Biancamaria,

Good luck to your project.

When you say two assemblies, you mean two sequences (each assembly is a sequence?) in two FASTA files? Could you send me two files for me to look into?

  1. The list file with only those two assembilies.
  2. The log file If the log file is large, could you put the list file and log file in a cloud space e.g. google drive, from which I can dowload?

Xie

biflorenzi commented 3 years ago

Hi @xiezhq, Thank you!! I am sorry if I used the term improperly, with assembly I meant 'sample' or 'input file' or 'fasta file'. I have uploaded the files to google drive: https://drive.google.com/drive/folders/1yYbZFKRO846VrMkRKgb9KNyF-0Y1uP1O?usp=sharing

Biancamaria

xiezhq commented 3 years ago

I checked the log file for GCF_900516075.1_18174_7_73_genomic.fna, everything was going well.

Have you tried running ISEScan on single genome file, GCF_900516075.1_18174_7_73_genomic.fna? For example,

isescan.py --seqfile GCF_900516075.1_18174_7_73_genomic.fna --output results

I found a bug in ISEScan when tried GCF_900516075.1_18174_7_73_genomic.fna. I updated ISEScan, ISEScan reported 53 IS copies (both complete and partial copies) in results/input/GCF_900516075.1_18174_7_73_genomic.fna.csv.

You can download the latest pred.py (https://github.com/xiezhq/ISEScan/blob/master/pred.py) to your ISEScan install directory to overwirte and update your existing ISEScan. And run ISEScan for those two genomes again.

xiezhq commented 3 years ago

Hi Biancamaria, I believe the issue has been solved in the latest ISEScan version, v1.7.2.3.

biflorenzi commented 3 years ago

Hi Xie, thank you for taking the time to look into it! I had indeed tried to run the tool on the single files last week, again without getting the expected outputs. I am glad you managed to find and the solve a bug in pred.py; as I mentioned, it was not a big problem for me to be missing only two samples' outputs out of my whole dataset, but I thought it could be useful to report this problem! Anyhow, I will now try out the new version and hopefully be able to include those samples in my project.

Thank you again for your replies and dedication, it is much appreciated.

Biancamaria