zstephens / telogator2

Allele-specific telomere analysis from long reads
MIT License
9 stars 1 forks source link

Multiple input file error #2

Open niyati1211 opened 1 month ago

niyati1211 commented 1 month ago

I ran telogator2 on ten samples. The program ran for over 30 minutes and then gave me the following error:

Error: aligning subtels to reference...Error: alignment command returned an error: -9

Command: python telogator2.py -i lrwgs1.fa.gz lrwgs2.fa.gz lrwgs3.fa.gz lrwgs4.p_ctg.fa.gz lrwgs5.fa.gz lrwgs6.fa.gz 1004266.p_ctg.fa.gz lrwgs7.fa.gz lrwgs8.fa.gz lrwgs9.fa.gz lrwgs10.fa.gz -o results/ --muscle /home/jupyter/miniconda3/envs/telogator2/bin/muscle --winnowmap /home/jupyter/miniconda3/envs/telogator2/bin/winnowmap -r hifi -tt 0.250 -ts 0.250 -n 2

I don't receive the error when I run each sample individually. My goal is to run telogator2 for ~1000 samples, so what is the best approach to analyze all the samples? Is it possible to input a text file with the list of filenames?

zstephens commented 1 month ago

Greetings! Since the subtel alignment step failed, it should've gotten to the point where the results/temp/unanchored_subtels.fa.gz intermediary file was generated. Can you try running:

winnowmap -ax map-hifi -Y -o out.sam -W telogator2/resources/telogator-ref-k15.txt telogator2/resources/telogator-ref.fa.gz results/temp/unanchored_subtels.fa.gz

(with paths adjusted to match where the files are on your system)

That's essentially the command that's being run from within Telogator2 at the subtel alignment step, and if that command is failing then hopefully running it on its own here should provide a more detailed error message.

As for analyzing a large number of samples, are you running them as submitted jobs on an HPC system, or do you intend to process everything locally? My advice would depend on how many compute resources you're working with, as it would take quite awhile to run that many samples back to back sequentially.

niyati1211 commented 1 month ago

Thank you for the quick response! I will try running the winnowmap command. I am using a cloud environment (dsub and cromwell/nextflow are available).

Aditionally, a quick clarification since I may have been interpreting the multiple inputs incorrectly. So I have 1000 samples (different individuals) and for each individual, I have 3 fasta files: hiFi primary assembly, hifi haplotype assembly resolved for haplotype I, and hifi haplotype assembly resolved for haplotype II. The approach I am currently thinking of using is to include all three files as input for every individual (this seems to provide better and more estimates of ATL) and then run telogator2 separately for every individual.

zstephens commented 1 month ago

Oh, interesting. So you're not analyzing long read data, but rather entire assembled contigs (from hifiasm or similar program, I imagine?). In that case I would recommend running telogator2 with -n 1, as you're interested in finding telomeres at the end of large, singular contigs.

I never tried analyzing such data, but if there are links anywhere to publicly available assemblies comparable to what you're running, I'd be interested to download it and see how it works.

niyati1211 commented 1 month ago

Ohh that makes sense, yes, this is assembled contigs from hifiasm. Though, when I try with -n 1, I get the following error:

refining clusters [TVR]...Traceback (most recent call last):
  File "/home/jupyter/telogator2/telogator2.py", line 1068, in <module>
    main()
  File "/home/jupyter/telogator2/telogator2.py", line 582, in main
    subset_clustdat = cluster_tvrs(khd_subset, KMER_METADATA, fake_chr, fake_pos, TREECUT_REFINE_TVR, TREECUT_PREFIXMERGE,
  File "/home/jupyter/telogator2/source/tg_tvr.py", line 429, in cluster_tvrs
    Zread      = linkage(dist_array, method='ward')
  File "/home/jupyter/miniconda3/envs/telogator2/lib/python3.9/site-packages/scipy/cluster/hierarchy.py", line 1033, in linkage
    n = int(distance.num_obs_y(y))
  File "/home/jupyter/miniconda3/envs/telogator2/lib/python3.9/site-packages/scipy/spatial/distance.py", line 2605, in num_obs_y
    raise ValueError("The number of observations cannot be determined on "
ValueError: The number of observations cannot be determined on an empty distance matrix.

Command:

python /home/jupyter/telogator2/telogator2.py -i read.haplotype1.fa.gz read.primary.fa.gz read.haplotype2.fa.gz -o results/ --muscle /home/jupyter/miniconda3/envs/telogator2/bin/muscle --winnowmap /home/jupyter/miniconda3/envs/telogator2/bin/winnowmap -r hifi -tt 0.250 -ts 0.250 -n 1
zstephens commented 1 month ago

Ahah. It's been awhile since I've tested edge cases like this (several sections of the code have implicit assumptions that multiple reads will be present). I'll go through and add checks for so that steps can proceed even with a single read per allele. I'll try to get this updated ASAP.

zstephens commented 1 month ago

That was easier than anticipated. I pushed some fixes to commit 696132774a0d288c995ef3e892b3d53924acd632, could you try pulling the latest version and rerunning?

niyati1211 commented 1 month ago

It works now! Thank you so much!!!

My understanding of telogator2 is that it first extracts reads with canonical telomere repeats, then estimates the telomere/subtelomere boundaries based on the densities of repeats in a sliding window, selects reads that span telomere and non-telomere regions. The alleles are identified based on the TVR region; clusters read based on TVR and then aligns to reference to see which ones anchor to subtelomere? Since I am working with assembled contigs, is it correct to think that the clustering step doesn't happen per se? It just estimates telomeres based on a single “read”.

I will update you on my results when I run on all samples!

zstephens commented 1 month ago

Indeed, most of the important clustering will still function (partitioning your 'reads' into individual alleles based on the uniqueness of their subtelomere and TVR regions), but certain cluster refinement steps will all be skipped, e.g. multiple sequence alignment for TVR sequences. I don't know hifiasm well enough to speculate whether the telomere length in its assembled contigs include all the telomere sequence found in its constituent reads, but if you happen to have the raw read data laying around it could be interesting to run those and see how it compares.

niyati1211 commented 1 month ago

That makes sense! Thank you so much!