Closed kuanchiun closed 2 years ago
Greetings, the commands you've shown above appear correct at first glance. Is the HG002 Nanopore data you used publicly available? If so I can try to replicate the commands to see if I arrive at the same corrupted result.
Thank you for reply.
The HG002 Nanopore data what I use is NCBI SRA run ID SRR12898351 which mentioned in Telogator research paper, I used prefetch
and fasterq-dump
to convert it into the fastq file, and trimmed the adapter by porechop
.
The Telogator research paper mentioned that from the Nanopore data, Telogator reports TL for 39/46 chromosome arms, with 6p, 8p, 9p, 15p, 17p, 19q and 22p absent.
But after the analysis following by the command, I can see there are some reads located in 9p, 15p,17p and 22p in tlens_violin_03_all-pass.png. Probably there are some difference during the analysis step?
Thanks again!
Ahh ok. For the paper I had processed hg002 nanopore data under run ID SRR13062494. I have begun downloading the data from run ID SRR12898351 and will follow up once the workflow completes.
Another possible explanation for why you are observing chromosomes that were not reported in the paper is that since then we have made some improvements to Telogator, including the addition of the nanopore_kmers.tsv file, as suggested by #1 .
Got it, thank you a lot!
Maybe the difference of result is because of the different sample used in process?
By the way, does Telogator spend lots of time to do analysis? Last time I cost a whole weekend to complete telogator.py. There seems to be no progress bar during the analysis, can you consider adding a progress bar?
There are a few ways to speed up the workflow, specifically grab_reads_from_subtel_aln.py
runs a bit faster if your fasta does not have newlines in the read sequence (e.g. using ./fastq-dump --fasta 0
to get the reads from SRA, and using the --newlines no
input option). telogator.py
can be accelerated by running multiple invocations in parallel and merging the results, via the --job
input option. Also, you can tell telogator to spit out progress as it processes each read using the --debug
input option.
I acknowledge that this is a bit sloppy at the moment. For the next update I plan to switch over to using multiprocessing, so that telogator can be parallelized without having to run multiple commands. I'll also add some progress indications.
(I downloaded the nanopore data and have started the workflow, I'll follow up on the results soon)
Greetings, I was able to successfully process this data using the following commands (filepaths simplified for readability):
(0) download reads (with no wrapping)
./fastq-dump --fasta 0 --stdout SRR12898351 | gzip > SRR12898351.fa.gz
(1) initial alignment
minimap2 -ax map-ont -t 6 -N 5 -Y -L t2t-and-subtel.fa SRR12898351.fa.gz | samtools view -bhS - > SRR12898351.bam
samtools sort -@ 3 -T SRR12898351.temp -o SRR12898351-sort.bam SRR12898351.bam
samtools index SRR12898351-sort.bam
(2) extract reads aligned to subtelomere regions and align them to telogator ref
python3 grab_reads_from_subtel_aln.py --fa SRR12898351.fa.gz --bam SRR12898351-sort.bam --bed subtel_regions.bed --out SRR12898351_subtel-reads.fa.gz --newlines no --readtype SRA
minimap2 -ax map-ont -t 6 -N 5 -Y -L t2t-telogator-ref.fa SRR12898351_subtel-reads.fa.gz | samtools view -bhS - > subtel-aln.bam
samtools sort -@ 3 -T subtel-aln.temp -o subtel-aln-sort.bam subtel-aln.bam
samtools index subtel-aln-sort.bam
Here are the files produced thus far:
-rw-r--r-- 1 zach staff 24G Mar 5 23:44 SRR12898351-sort.bam
-rw-r--r-- 1 zach staff 29M Mar 6 11:21 SRR12898351-sort.bam.bai
-rw-r--r-- 1 zach staff 14G Mar 5 11:02 SRR12898351.fa.gz
-rw-r--r-- 1 zach staff 129M Mar 6 15:56 SRR12898351_subtel-reads.fa.gz
-rw-r--r-- 1 zach staff 568M Mar 6 16:46 subtel-aln-sort.bam
-rw-r--r-- 1 zach staff 605K Mar 6 16:46 subtel-aln-sort.bam.bai
(3) run telogator
samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 1 8
samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 2 8
samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 3 8
samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 4 8
samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 5 8
samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 6 8
samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 7 8
samtools view subtel-aln-sort.bam | python3 telogator.py -k nanopore_kmers.tsv -i - -o telogator/ --job 8 8
(running in parallel)
python3 merge_jobs.py -i /Volumes/Epoch99/BIG_RELOCATED_DATA_2/hg002_nanopore_promethion/telogator/
The run seems to have completed successfully:
Here is results.tsv:
#subtel position tel_len_p90 tel_lens read_lens
chr1_p 3083 4551 2981,3316,3437,3991,4231,4249,4427,4842 16380,8261,20587,16677,23859,17295,21735,40114
chr1_q 496458 6932 381,3018,5235,5948,6209,7656 8971,27067,20007,11078,21399,16958
chr3_p 2817 10089 4657,5767,6008,6043,6355,6528,7105,7193,10077,10206 7400,14591,13462,7449,11805,23025,8681,13445,51249,37469
chr3_q 298047 1582 1144,1631 15277,6648
...
Which look fine.
The main thing that jumps out to me is the --readtype SRA
parameter for grab_reads_from_subtel_aln.py
. I think for this data it is crucial to indicate that the reads following the SRA naming convention, otherwise I'm not sure what your subtel-reads fasta would contain. I think I could improve the documentation to point this out.
Thank you for testing, the figure result looks like which we analysis, so we can confirm that the difference between the result on the paper and the result what we analysis is due to difference of SRA sample.
Our read name which generated by fasterq-dump
looks like SRR12898351.sra.8432143
, so the weird value in tsv file maybe caused by the different type of read name? Sounds cool.
And how should I run telogator parallelly? By using nohup
? or just open multiple terminal and write the command?
Thanks again!
To run them parallel in a single terminal you could use something like:
python3 telogator.py --job 1 2 &
python3 telogator.py --job 2 2 &
wait
To make it a bit more user-friendly I hope to incorporate multiprocessing into telogator soon, so that it could be run with a single command and the user could specify how much to parallelize it.
This sounds great and I'll keep watching. Thanks again for your detailed reply.
Hello! I got a little confused about the output file.
I used the HG002 Nanopore data and run telogator with the command in the README.md, and change the pbmm2 into minimap2.
minimap2 -ax map-ont -t 16 -N 5 -Y -L t2t-and-subtel.fa HG002_nanopore_trimmed.fastq > HG002_telogator_new.sam
samtools view -bSh HG002_telogator_new.sam > HG002_telogator_new.bam
samtools sort -O bam -o HG002_telogator_new_sorted.bam HG002_telogator_new.bam
samtools index HG002_telogator_new_sorted.bam
python ../../../../software/telogator/grab_reads_from_subtel_aln.py --bam HG002_telogator_new_sorted.bam --fa HG002_nanopore_trimmed.fasta --out new-subtel-reads.fa.gz --bed ../../../../software/telogator/resources/subtel_regions.bed
minimap2 -ax map-ont -t 24 -N 5 -Y -L ../../../../software/telogator/resources/t2t-telogator-ref.fa new-subtel-reads.fa.gz > new_subtel_aln.sam
samtools view -bSh new_subtel_aln.sam > new_subtel_aln.bam
samtools sort -O bam -o new_subtel_aln_sorted.bam new_subtel_aln.bam
samtools index new_subtel_aln_sorted.bam
samtools view new_subtel_aln_sorted.bam | python ../../../../software/telogator/telogator.py -i - -o new_telogator_out/ -k ../../../../software/telogator/resources/nanopore_kmers.tsv
python ../../../../software/telogator/merge_jobs.py -i new_telogator_out/
After analysis, I got the output file like this.
And when I open the results.tsv, I found that the value in tel_lens and read_lens are weird, seems like very large number.
Is there any error during my step? And how to explain that weird large number in tel_lens and read_lens?
Thank you very much!