yechengxi / DBG2OLC

A genome assembler that reduces the computational time of human genome assembly from 400,000 CPU hours to 2,000 CPU hours, utilizing long erroneous 3GS sequencing reads and short accurate NGS sequencing reads.
GNU General Public License v3.0
66 stars 27 forks source link

multithreading #23

Closed janvanoeveren closed 7 years ago

janvanoeveren commented 7 years ago

Hi, I was just wondering if there's a multiple threading option for DBG2OLC - cause it seems to be quite slow and takes only 5% cpu on a multicore server .... What am I missing?

yechengxi commented 7 years ago

DBG2OLC was implemented as a single thread program. However, if you have a lot of computational resources, the way to get it faster is to split your 3GS reads into different batches and Run DBG2OLC on each of them.

DBG2OLC will compute the compressed reads and try to assemble. You may stop your program anytime once the compression is complete.

In a final call of the program, include all the files and use LD 1 to load all the precomputed compressed reads to assemble.

janvanoeveren commented 7 years ago

Dear Chengxi Ye, Thanks for your fast reply. Does this approach not cause problems with creating the overlap in compressed reads? And what about the k-mer analysis of the contigs? Is this redone for each DBG2OLC call or does it use previously created ContigKmerIndex files? Is there a parameter for this?

Thanks, Jan

Op 2 mei 2017 18:44 schreef "Chengxi Ye" notifications@github.com:

DBG2OLC was implemented as a single thread program. However, if you have a lot of computational resources, the way to get it faster is to split your 3GS reads into different batches and Run DBG2OLC on each of them.

DBG2OLC will compute the compressed reads and try to assemble. You may stop your program anytime once the compression is complete.

In a final call of the program, include all the files and use LD 1 to load all the precomputed compressed reads to assemble.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yechengxi/DBG2OLC/issues/23#issuecomment-298691702, or mute the thread https://github.com/notifications/unsubscribe-auth/ALihQWOWPXDF6L7utpbgpI1_tAhFp19kks5r111VgaJpZM4NN7_J .

yechengxi commented 7 years ago

As long as you use the same parameters and the contigs, the k-mer analysis will produce the same result. The most time consuming step is to calculate the compressed reads. When you use 'LD 1' and include the full set of reads, DBG2OLC will load all the precomputed compressed reads and recompute the overlaps. The previous overlaps computed with each subset are discarded.

When we assemble the human genome, we splitted the pacbio reads into a few batches and put in separate folders. The same set of parameters and Illumina contigs are used to generate the k-mer index and compressed reads. Then we move all the compressed reads into one folder. And call DBG2OLC again with LD 1 and feed all the PacBio reads in the command.

yechengxi commented 7 years ago

I have summarized the procedure in the project page. This is a very good question.

janvanoeveren commented 7 years ago

Actually, the k-mer analysis of the contigs takes a long time for my data set (~ 2 days), so this is really a pity doing this for all PacBio subsets. Maybe you could change this to optionally taking the ContigKmerIndex_HT as input?

yechengxi commented 7 years ago

That's also possible, there is another undocumented option. You can use 'LD0 1' to load that.

janvanoeveren commented 7 years ago

Thanks - I guess I then have to copy the "ContigKmerIndex_HT_content" and "ContigKmerIndex_HT_idx.txt" files to the working directory? And not specify the Contigs parameter?

janvanoeveren commented 7 years ago

[2]- Segmentation fault (core dumped) /opt/kgapps/DBG2OLC-20170411/DBG2OLC AdaptiveTh 0.001 KmerCovTh 2 MinOverlap 10 LD0 1 f ../input_PacBio/sra_data.fastq > DBG2OLC_test_LD0.log

... any clue?

yechengxi commented 7 years ago

You will need to feed the contigs file as usual.