Step 4 taking too long - Githubissues

re2srm commented 7 months ago

Hello,

I am running some human hic data through the pipeline and step 4 is taking extremely long. At the end of step3 I have around 150M uniquely mapping pairs and 35M multimapping pairs. Up to now my step4 has been running for around 24 hours and the only outputs are validPairs.MULTI and validPairs.UNI files for chromosome 1 and 2. At this rate this step would take >10 days to run. Is this normal and are there any parameters I can tune to make this faster? I am currently using all default parameters from the demo file (binning resolution of 10k with KR normalization).

I can use up to 24 cpus and 100G of memory but this step doesn't seem to have any options to use multiple cpus. Would appreciate any feedback.

Thanks

yezhengSTAT commented 7 months ago

Hello, It seems that you got stuck with the duplicates removal step which consists of a lot of sorting and merging. You may consider separating the input file by the chromosomes into individual files and then running step 4 on each chromosome file in parallel. Otherwise, running at 10kb can be quite heavy for the human genome depending on the sequencing depth. You may consider starting with the demo data or running at a much lower resolution (such as 1Mb or even 10Mb) just to make sure all the steps run successfully.

Hope it helps.

Thanks, Ye

re2srm commented 7 months ago

Thanks, I already ran the demo data and everything worked fine. For splitting by chromosome, can I just specify a single chromosome in the chromosome list parameter and then run the step separately for each chromosome (instead of splitting the input file)?

I ran the previous steps at a resolution of 10k. Would it be ok if I now run step 4 at a higher resolution or do I have to start from the beginning? Also are you recommending 1mb or 10mb just for testing or for actual analysis?

Thanks

yezhengSTAT commented 7 months ago

Thanks, I already ran the demo data and everything worked fine. For splitting by chromosome, can I just specify a single chromosome in the chromosome list parameter and then run the step separately for each chromosome (instead of splitting the input file)?

Yes, that should also work!

I ran the previous steps at a resolution of 10k. Would it be ok if I now run step 4 at a higher resolution or do I have to start from the beginning? Also are you recommending 1mb or 10mb just for testing or for actual analysis?

Yeah.....you have to go back to step3 as the bin index is given in that step. Actually, most duplicate removal procedures have nothing to do with binning. Basically, it sorts the alignment position (chrom-start-end) and removes those reads with the same genomic coordinate. In other words, you probably won't observe too much acceleration until the normalization step. You may monitor your memory to see if that is the bottleneck limiting the sorting procedure. Or you may further consider splitting one chromosome into small chunk of files that are ordered by starting position or similar strategies......

Choosing a resolution depends on the nature of your study. If you want to focus on a more general structure like compartment or TAD, 1Mb or 100kb may be sufficient. If your study requires a much-refined structure or your target is at a small range of genome, then a much higher resolution is needed.

Best, Ye

re2srm commented 7 months ago

Thanks, I am now doing it in parallel after splitting by chromosome. It seems the bottleneck is in the duplicate removal step as increasing the resolution doesn't speed up the process. Just wanted to ask how I should merge the output files from the chromosomes for the next steps. Should I just use the 'cat' command to concatenate them like this:

cat .MULTI.binPair.multi > allchroms.MULTI.binPair.multi cat .MULTI.binPair.multiKeys > allchroms.MULTI.binPair.multiKeys cat .binPairCount.uni > allchroms.binPairCount.uni cat .binPair.multi > allchroms.binPair.multi

yezhengSTAT commented 7 months ago

Yes, you are right!

re2srm commented 7 months ago

Hi,

I am getting this error when running step6 on some of my chromosomes:

Traceback (most recent call last): File "/scratch/mHiC/s6_em.py", line 88, in main(priorPath, multiFilePath, multiKeysPath, uniFilePath, outFilePath, threshold) File "s6_em_cython.pyx", line 306, in s6_em_cython.main (spline, interProb) = read_spline_prior(priorPath) File "s6_em_cython.pyx", line 79, in s6_em_cython.read_spline_prior interProb = min(spline.values())/2 ValueError: min() arg is an empty sequence

This is only happening on a few chromosomes and most run fine. I ran step5 with normalization set to 'None' (because I was having some other errors using KR normalization). Any idea why this might happen?

Thanks

yezhengSTAT commented 7 months ago

It seems that the spline is not fitted correctly. Maybe......it is because the data is too sparse for some chromosomes. Maybe try with larger bins?

Thanks, Ye

yezhengSTAT / mHiC

Step 4 taking too long #23