stag check_input does not check for allowed characters

jakob-wirbel commented 4 years ago

After running check_input, the program did not raise an error. However, when i then wanted to run stag train, there we characters that were not allowed in the fasta file. It should be possible to test for this in the check_input function already, i guess.

AlessioMilanese commented 4 years ago

Which characters are causing problem?

jakob-wirbel commented 4 years ago

- characters from the alignment (i had forgotten to remove them at first)

AlessioMilanese commented 4 years ago

Can you copy-paste the error that you get?

jakob-wirbel commented 4 years ago

will have to re-create this, but hope i can do it this afternoon at some point

jakob-wirbel commented 4 years ago

check_input works fine:

stag check_input -i <input.fasta> -x <tax_file.tax? -a ,hmm_file.hmm.
------ CHECK TAXONOMY FILE:
Detected 7 taxonomic levels
Check number of taxonomy levels.......................correct
Check if the names are unique across levels...........correct
Check if there are multiple parents...................correct
Found 38161 genes (lines)
------ CHECK FASTA FILE:
Check that the sequences are in fasta format..........correct
Number of genes: 38161
Number of unique genes: 12450
------ CHECK CORRESPONDENCES:
Check correspondences of gene ids to the tax ids......correct
Check taxonomy of genes with same sequence............

 WARNING: Some genes have same sequence, but different taxonomy.
------ CHECK TOOL:
Check that 'hmmalign' is in the path..................correct
Check that 'esl-reformat' is in the path..............correct
Try to run alignment tool.............................correct

Check alignment quality:
 Internal states: 193

 Sequence 1:
   Internal states matches: 1 (1%)
   Deletions: 192 (99%)
   Insertions: 203

 Sequence 2:
   Internal states matches: 1 (1%)
   Deletions: 192 (99%)
   Insertions: 203

 Sequence 3:
   Internal states matches: 1 (1%)
   Deletions: 192 (99%)
   Insertions: 203

But then the train function has some problems:

stag train -i <input.fasta> -x <tax_file.tax> -a <hmm_file.hmm> -o <output.stagDB>
Parse failed (sequence file input.fasta):
Line 762: illegal character -

Alignment input open failed.
   can't guess alignment input format: empty file/no data
   while reading from an input stream (not a file)
[E::align] Error. hmmalig/cmalign failed

the problem was that the fasta file was actually not a fasta but the alignment:

 head input.fasta 
>d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcusaureus_0
............................................................
..GU.C..C.A.G....A.C.U.CC...UA..C.G.........................
............................................................
..............................................G...G.A..G...G
.C.A...G.C..A......G.U.A...G.....G.....G...A.A..U...C.U.UC.C
.G...C.A..A.U..G.G.G...C....G..A.A.A........................
............................................................
............................................................
..............................................GC.C.U.GA..C..

jakob-wirbel commented 4 years ago

so, in summary, i guess that the check_input function could test for this already :)

AlessioMilanese commented 4 years ago

The alignment is done also in the check_input routine, but only on the first 3 genes. So I guess that the - is not in the first three sequences.

I'll add it to the check_input if it doesn't take too long to run! (Probably not) The idea is that the check_input should run fast, and find possible issues, before running the heavy train.

Also,

 Sequence 1:
   Internal states matches: 1 (1%)
   Deletions: 192 (99%)
   Insertions: 203

is a little bit worrisome. I would expect to have at least 70% match to the Internal states matches. Because these are actually the features that are used to train the classifiers. I usually see ~95%.

zellerlab / stag

stag check_input does not check for allowed characters #12