Closed jakob-wirbel closed 3 years ago
Which characters are causing problem?
-
characters from the alignment (i had forgotten to remove them at first)
Can you copy-paste the error that you get?
will have to re-create this, but hope i can do it this afternoon at some point
check_input
works fine:
stag check_input -i <input.fasta> -x <tax_file.tax? -a ,hmm_file.hmm.
------ CHECK TAXONOMY FILE:
Detected 7 taxonomic levels
Check number of taxonomy levels.......................correct
Check if the names are unique across levels...........correct
Check if there are multiple parents...................correct
Found 38161 genes (lines)
------ CHECK FASTA FILE:
Check that the sequences are in fasta format..........correct
Number of genes: 38161
Number of unique genes: 12450
------ CHECK CORRESPONDENCES:
Check correspondences of gene ids to the tax ids......correct
Check taxonomy of genes with same sequence............
WARNING: Some genes have same sequence, but different taxonomy.
------ CHECK TOOL:
Check that 'hmmalign' is in the path..................correct
Check that 'esl-reformat' is in the path..............correct
Try to run alignment tool.............................correct
Check alignment quality:
Internal states: 193
Sequence 1:
Internal states matches: 1 (1%)
Deletions: 192 (99%)
Insertions: 203
Sequence 2:
Internal states matches: 1 (1%)
Deletions: 192 (99%)
Insertions: 203
Sequence 3:
Internal states matches: 1 (1%)
Deletions: 192 (99%)
Insertions: 203
But then the train function has some problems:
stag train -i <input.fasta> -x <tax_file.tax> -a <hmm_file.hmm> -o <output.stagDB>
Parse failed (sequence file input.fasta):
Line 762: illegal character -
Alignment input open failed.
can't guess alignment input format: empty file/no data
while reading from an input stream (not a file)
[E::align] Error. hmmalig/cmalign failed
the problem was that the fasta file was actually not a fasta but the alignment:
head input.fasta
>d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcusaureus_0
............................................................
..GU.C..C.A.G....A.C.U.CC...UA..C.G.........................
............................................................
..............................................G...G.A..G...G
.C.A...G.C..A......G.U.A...G.....G.....G...A.A..U...C.U.UC.C
.G...C.A..A.U..G.G.G...C....G..A.A.A........................
............................................................
............................................................
..............................................GC.C.U.GA..C..
so, in summary, i guess that the check_input
function could test for this already :)
The alignment is done also in the check_input
routine, but only on the first 3 genes. So I guess that the -
is not in the first three sequences.
I'll add it to the check_input
if it doesn't take too long to run! (Probably not)
The idea is that the check_input
should run fast, and find possible issues, before running the heavy train
.
Also,
Sequence 1:
Internal states matches: 1 (1%)
Deletions: 192 (99%)
Insertions: 203
is a little bit worrisome. I would expect to have at least 70% match to the Internal states matches
. Because these are actually the features that are used to train the classifiers. I usually see ~95%.
After running
check_input
, the program did not raise an error. However, when i then wanted to runstag train
, there we characters that were not allowed in the fasta file. It should be possible to test for this in thecheck_input
function already, i guess.