yfukasawa / LongQC

LongQC is a tool for the data quality control of the PacBio and ONT long reads.
MIT License
148 stars 18 forks source link

LongQC is time-consuming #65

Closed mhjiang97 closed 9 months ago

mhjiang97 commented 10 months ago

Hi there.

What I ran:

python longqc.py sampleqc -x ont-ligation -o mysample -s mysample -n 5000 -p 20 -m 2 -i 3 mysample.fq.gz 

It is still running after 4 days.

The log file now is like this:

longQC:2024-01-08 22:11:32,679:524:INFO:Calculating overlaps of sampled reads...
...
longQC:2024-01-12 17:54:15,362:524:INFO:Calculating overlaps of sampled reads...

The coverage_err.txt is repeating something like this:

[M::worker_pipeline::332904.300*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332904.328*4.61] collected minimizers
[M::mm_idx_gen::332904.345*4.61] sorted minimizers
[M::mm_mapopt_update::332904.345*4.61] mid_occ = 2
[M::main::332904.345*4.61] loaded/built the index for 1 target sequence(s)
[M::worker_pipeline::332904.961*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332904.987*4.61] collected minimizers
[M::mm_idx_gen::332905.034*4.61] sorted minimizers
[M::mm_mapopt_update::332905.034*4.61] mid_occ = 2
[M::main::332905.034*4.61] loaded/built the index for 1 target sequence(s)
[M::worker_pipeline::332905.731*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332905.743*4.61] collected minimizers
[M::mm_idx_gen::332905.763*4.61] sorted minimizers
[M::mm_mapopt_update::332905.763*4.61] mid_occ = 2
[M::main::332905.764*4.61] loaded/built the index for 1 target sequence(s)
[M::worker_pipeline::332906.406*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332906.435*4.61] collected minimizers
[M::mm_idx_gen::332906.443*4.61] sorted minimizers
[M::mm_mapopt_update::332906.444*4.61] mid_occ = 2
[M::main::332906.444*4.61] loaded/built the index for 1 target sequence(s)
[M::worker_pipeline::332907.028*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332907.062*4.61] collected minimizers
[M::mm_idx_gen::332907.086*4.61] sorted minimizers
[M::mm_mapopt_update::332907.086*4.61] mid_occ = 2
[M::main::332907.086*4.61] loaded/built the index for 1 target sequence(s)
[M::worker_pipeline::332908.003*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332908.014*4.61] collected minimizers
[M::mm_idx_gen::332908.050*4.61] sorted minimizers
[M::mm_mapopt_update::332908.050*4.61] mid_occ = 2
[M::main::332908.050*4.61] loaded/built the index for 1 target sequence(s)
[M::worker_pipeline::332908.685*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332908.715*4.61] collected minimizers
[M::mm_idx_gen::332908.722*4.61] sorted minimizers
[M::mm_mapopt_update::332908.722*4.61] mid_occ = 2
[M::main::332908.722*4.61] loaded/built the index for 1 target sequence(s)
[M::worker_pipeline::332909.325*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332909.341*4.61] collected minimizers
[M::mm_idx_gen::332909.370*4.61] sorted minimizers
[M::mm_mapopt_update::332909.370*4.61] mid_occ = 2
[M::main::332909.370*4.61] loaded/built the index for 1 target sequence(s)
[M::worker_pipeline::332910.056*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332910.103*4.61] collected minimizers
[M::mm_idx_gen::332910.125*4.61] sorted minimizers
[M::mm_mapopt_update::332910.125*4.61] mid_occ = 2
[M::main::332910.125*4.61] loaded/built the index for 1 target sequence(s)
[M::worker_pipeline::332910.901*4.61] mapped 5000 sequences. (Peak RSS: 7.200 GB)
[M::mm_idx_gen::332910.919*4.61] collected minimizers
[M::mm_idx_gen::332910.922*4.61] sorted minimizers
[M::mm_mapopt_update::332910.922*4.61] mid_occ = 2
[M::main::332910.922*4.61] loaded/built the index for 1 target sequence(s)

How can I make this programme faster...Thanks in advance.

yfukasawa commented 10 months ago

Hi @mhjiang97,

Running for 4 days is not usual, but I would say it's possible. I guess the size of your data could be quite large. Or, in some repetitive genomes or genomes having skewed GC content like P. falciparum, it can take by far longer time.

One remedy for such cases should be using -f or --fast option, parameters are tuned for the speed with some tradeoff. It can still be time consuming even with such option, but better for sure.

I hope this helps.

Yoshinori

yfukasawa commented 9 months ago

After monitoring, it seems that the issue has been resolved. As a result, I will proceed to close it.