zstephens / telogator

A method for measuring chromosome-specific telomere length from long reads
GNU General Public License v3.0
20 stars 1 forks source link

Some value are weird in output results.tsv #2

Closed kuanchiun closed 2 years ago

kuanchiun commented 2 years ago

Hello! I got a little confused about the output file.

I used the HG002 Nanopore data and run telogator with the command in the README.md, and change the pbmm2 into minimap2.

minimap2 -ax map-ont -t 16 -N 5 -Y -L t2t-and-subtel.fa HG002_nanopore_trimmed.fastq > HG002_telogator_new.sam samtools view -bSh HG002_telogator_new.sam > HG002_telogator_new.bam samtools sort -O bam -o HG002_telogator_new_sorted.bam HG002_telogator_new.bam samtools index HG002_telogator_new_sorted.bam python ../../../../software/telogator/grab_reads_from_subtel_aln.py --bam HG002_telogator_new_sorted.bam --fa HG002_nanopore_trimmed.fasta --out new-subtel-reads.fa.gz --bed ../../../../software/telogator/resources/subtel_regions.bed minimap2 -ax map-ont -t 24 -N 5 -Y -L ../../../../software/telogator/resources/t2t-telogator-ref.fa new-subtel-reads.fa.gz > new_subtel_aln.sam samtools view -bSh new_subtel_aln.sam > new_subtel_aln.bam samtools sort -O bam -o new_subtel_aln_sorted.bam new_subtel_aln.bam samtools index new_subtel_aln_sorted.bam samtools view new_subtel_aln_sorted.bam | python ../../../../software/telogator/telogator.py -i - -o new_telogator_out/ -k ../../../../software/telogator/resources/nanopore_kmers.tsv python ../../../../software/telogator/merge_jobs.py -i new_telogator_out/

After analysis, I got the output file like this. 擷取1

And when I open the results.tsv, I found that the value in tel_lens and read_lens are weird, seems like very large number. 擷取3

Is there any error during my step? And how to explain that weird large number in tel_lens and read_lens?

Thank you very much!

zstephens commented 2 years ago

Greetings, the commands you've shown above appear correct at first glance. Is the HG002 Nanopore data you used publicly available? If so I can try to replicate the commands to see if I arrive at the same corrupted result.

kuanchiun commented 2 years ago

Thank you for reply.

The HG002 Nanopore data what I use is NCBI SRA run ID SRR12898351 which mentioned in Telogator research paper, I used prefetch and fasterq-dump to convert it into the fastq file, and trimmed the adapter by porechop.

The Telogator research paper mentioned that from the Nanopore data, Telogator reports TL for 39/46 chromosome arms, with 6p, 8p, 9p, 15p, 17p, 19q and 22p absent.

But after the analysis following by the command, I can see there are some reads located in 9p, 15p,17p and 22p in tlens_violin_03_all-pass.png. Probably there are some difference during the analysis step?

Thanks again!

tlens_violin_03_all-pass

zstephens commented 2 years ago

Ahh ok. For the paper I had processed hg002 nanopore data under run ID SRR13062494. I have begun downloading the data from run ID SRR12898351 and will follow up once the workflow completes.

Another possible explanation for why you are observing chromosomes that were not reported in the paper is that since then we have made some improvements to Telogator, including the addition of the nanopore_kmers.tsv file, as suggested by #1 .

kuanchiun commented 2 years ago

Got it, thank you a lot!

Maybe the difference of result is because of the different sample used in process?

By the way, does Telogator spend lots of time to do analysis? Last time I cost a whole weekend to complete telogator.py. There seems to be no progress bar during the analysis, can you consider adding a progress bar?

zstephens commented 2 years ago

There are a few ways to speed up the workflow, specifically grab_reads_from_subtel_aln.py runs a bit faster if your fasta does not have newlines in the read sequence (e.g. using ./fastq-dump --fasta 0 to get the reads from SRA, and using the --newlines no input option). telogator.py can be accelerated by running multiple invocations in parallel and merging the results, via the --job input option. Also, you can tell telogator to spit out progress as it processes each read using the --debug input option.

I acknowledge that this is a bit sloppy at the moment. For the next update I plan to switch over to using multiprocessing, so that telogator can be parallelized without having to run multiple commands. I'll also add some progress indications.

(I downloaded the nanopore data and have started the workflow, I'll follow up on the results soon)

zstephens commented 2 years ago

Greetings, I was able to successfully process this data using the following commands (filepaths simplified for readability):

(0) download reads (with no wrapping)

./fastq-dump --fasta 0 --stdout SRR12898351 | gzip > SRR12898351.fa.gz

(1) initial alignment

minimap2 -ax map-ont -t 6 -N 5 -Y -L t2t-and-subtel.fa SRR12898351.fa.gz | samtools view -bhS - > SRR12898351.bam samtools sort -@ 3 -T SRR12898351.temp -o SRR12898351-sort.bam SRR12898351.bam samtools index SRR12898351-sort.bam

(2) extract reads aligned to subtelomere regions and align them to telogator ref

python3 grab_reads_from_subtel_aln.py --fa SRR12898351.fa.gz --bam SRR12898351-sort.bam --bed subtel_regions.bed --out SRR12898351_subtel-reads.fa.gz --newlines no --readtype SRA minimap2 -ax map-ont -t 6 -N 5 -Y -L t2t-telogator-ref.fa SRR12898351_subtel-reads.fa.gz | samtools view -bhS - > subtel-aln.bam samtools sort -@ 3 -T subtel-aln.temp -o subtel-aln-sort.bam subtel-aln.bam samtools index subtel-aln-sort.bam

Here are the files produced thus far:

-rw-r--r-- 1 zach staff 24G Mar 5 23:44 SRR12898351-sort.bam -rw-r--r-- 1 zach staff 29M Mar 6 11:21 SRR12898351-sort.bam.bai -rw-r--r-- 1 zach staff 14G Mar 5 11:02 SRR12898351.fa.gz -rw-r--r-- 1 zach staff 129M Mar 6 15:56 SRR12898351_subtel-reads.fa.gz -rw-r--r-- 1 zach staff 568M Mar 6 16:46 subtel-aln-sort.bam -rw-r--r-- 1 zach staff 605K Mar 6 16:46 subtel-aln-sort.bam.bai

(3) run telogator

samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 1 8 samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 2 8 samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 3 8 samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 4 8 samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 5 8 samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 6 8 samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 7 8 samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 8 8

(running in parallel)

python3 merge_jobs.py -i /Volumes/Epoch99/BIG_RELOCATED_DATA_2/hg002_nanopore_promethion/telogator/

The run seems to have completed successfully:

tlens_violin_03_all-pass

Here is results.tsv:

#subtel position tel_len_p90 tel_lens read_lens chr1_p 3083 4551 2981,3316,3437,3991,4231,4249,4427,4842 16380,8261,20587,16677,23859,17295,21735,40114 chr1_q 496458 6932 381,3018,5235,5948,6209,7656 8971,27067,20007,11078,21399,16958 chr3_p 2817 10089 4657,5767,6008,6043,6355,6528,7105,7193,10077,10206 7400,14591,13462,7449,11805,23025,8681,13445,51249,37469 chr3_q 298047 1582 1144,1631 15277,6648 ...

Which look fine.

The main thing that jumps out to me is the --readtype SRA parameter for grab_reads_from_subtel_aln.py. I think for this data it is crucial to indicate that the reads following the SRA naming convention, otherwise I'm not sure what your subtel-reads fasta would contain. I think I could improve the documentation to point this out.

kuanchiun commented 2 years ago

Thank you for testing, the figure result looks like which we analysis, so we can confirm that the difference between the result on the paper and the result what we analysis is due to difference of SRA sample.

Our read name which generated by fasterq-dump looks like SRR12898351.sra.8432143, so the weird value in tsv file maybe caused by the different type of read name? Sounds cool.

And how should I run telogator parallelly? By using nohup? or just open multiple terminal and write the command?

Thanks again!

zstephens commented 2 years ago

To run them parallel in a single terminal you could use something like:

python3 telogator.py --job 1 2 & python3 telogator.py --job 2 2 & wait

To make it a bit more user-friendly I hope to incorporate multiprocessing into telogator soon, so that it could be run with a single command and the user could specify how much to parallelize it.

kuanchiun commented 2 years ago

This sounds great and I'll keep watching. Thanks again for your detailed reply.