zstephens / telogator2

A method for measuring allele-specific TL and characterizing telomere variant repeat (TVR) sequences from long reads.
MIT License
12 stars 1 forks source link

muscle error #5

Closed YonkoBigMom closed 2 months ago

YonkoBigMom commented 3 months ago

Hi,

I ran

python telogator2/telogator2.py -i my_file.reads.bam -o my_file.telogator2/ -p 5 --muscle ./muscle --minimap2 tools/minimap2-2.28_x64-linux/minimap2 -r ont

The code failed at:

initial clustering of all reads...Process Process-10: Traceback (most recent call last): File "/home/tiger/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/tiger/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/tiger/telogator2/source/tg_tvr.py", line 189, in parallel_msa_job [msa_seq, consensus_seq] = get_muscle_msa(clust_seq, muscle_exe, tempfile_prefix=my_prefix, char_score_adj=char_score_adj, noncanon_cheat=noncanon_cheat) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/telogator2/source/tg_muscle.py", line 76, in get_muscle_msa read_dat = my_reader.get_next_read() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/telogator2/source/tg_reader.py", line 82, in get_next_read my_dat = self.f.readline().strip() ^^^^^^^^^^^^^^^^^ File "", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 5261: invalid start byte Traceback (most recent call last): File "/home/tiger/telogator2/telogator2.py", line 1089, in main() File "/home/tiger/telogator2/telogator2.py", line 483, in main read_clust_dat = cluster_tvrs(kmer_hit_dat, KMER_METADATA, fake_chr, fake_pos, TREECUT_INITIAL, TREECUT_PREFIXMERGE, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/telogator2/source/tg_tvr.py", line 593, in cluster_tvrs tel_boundary = find_cumulative_boundary(denoised_consensus,tvr_letters, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/telogator2/source/tg_tvr.py", line 848, in find_cumulative_boundary if hits_cum[-1] < min_hits: # not enough hits to even bother trying


IndexError: index -1 is out of bounds for axis 0 with size 0
zstephens commented 3 months ago

Greetings! Indeed it looks like muscle step might be failing to produce an intermediary fasta file that the rest of the code is expecting. This might require looking at muscle's logs (which Telogator2 currently deletes immediately after running it). If you are willing to share the temp/tel_reads.fa.gz file that should have been produced, I'd be happy try running it myself to debug further.

YonkoBigMom commented 3 months ago

Sure,

Here is the file https://drive.google.com/file/d/1prDqOzTMq6ICuKfzP_YH6sUi9PfGT6Wz/view?usp=sharing

zstephens commented 3 months ago

I was able to process that data without any errors (results attached), so unfortunately I'm not any closer to resolving the issue you ran into. my_file_telogator2_output.zip

Are you able to run muscle on its own? I attached some intermediary files from the run, and you could check if the following command works without error:

muscle -in example_tvrs.fa -out msa.fa -seqtype protein -gapopen -12.0 -gapextend -4.0 -center 0.0 -matrix scoring_matrix.txt

scoring_matrix.txt example_tvrs.fa.zip

zstephens commented 2 months ago

Closing this for now, I recently updated Telogator2 so that it does its own MSA now (removing the need for muscle as a dependency). Feel free to reopen if there are further issues.