zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
142 stars 9 forks source link

Chromosome number parameters #47

Closed awesomedeer closed 3 months ago

awesomedeer commented 3 months ago

Dear Zeng, @zengxiaofei I have a same ERROR although I used *.p_ctg and through the mapping I used right _1 and _2 pair end reads. Traceback (most recent call last): File "/public/home/wusong/sfw/HapHiC/scripts/HapHiC_pipeline.py", line 529, in <module> main() File "/public/home/wusong/sfw/HapHiC/scripts/HapHiC_pipeline.py", line 510, in main haphic_cluster(args) File "/public/home/wusong/sfw/HapHiC/scripts/HapHiC_pipeline.py", line 393, in haphic_cluster raise RuntimeError( RuntimeError: Pipeline Abortion: Inflation recommendation failed. It seems that some chromosomes were grouped together, or the maximum number of clusters is even less than the expected number of chromosomes. For more details, please check out the logs. Traceback (most recent call last): File "/public/home/wusong/sfw/HapHiC/haphic", line 117, in <module> subprocess.run(commands, check=True) File "/public/home/wusong/miniconda3/envs/haphic/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['/public/home/wusong/sfw/HapHiC/scripts/HapHiC_pipeline.py', 'allhaps.fa', 'HiC.filtered.bam', '32', '--threads', '20', '--processes', '20', '--gfa', 'darwin.asm.hic.hap1.p_ctg.gfa,darwin.asm.hic.hap2.p_ctg.gfa,darwin.asm.hic.hap3.p_ctg.gfa,darwin.asm.hic.hap4.p_ctg.gfa']' returned non-zero exit status 1. I noticed that the file size for bam file changed dragmatically from 1.1G (HiC.bam) to 4.2M (HiC.filtered.bam), is it a potential problem?

Best regards Song


My log file: 2024-08-15 18:08:52 [run] Program started, HapHiC version: 1.0.5 (update: 2024.08.08) 2024-08-15 18:08:52 [run] Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] 2024-08-15 18:08:52 [run] Command: /public/home/wusong/sfw/HapHiC/scripts/HapHiC_pipeline.py allhaps.fa HiC.filtered.bam 32 --threads 20 --processes 20 --gfa darwin.asm.hic.hap1.p_ctg.gfa,darwin.asm.hic.hap2.p_ctg.gfa,darwin.asm.hic.hap3.p_ctg.gfa,darwin.asm.hic.hap4.p_ctg.gfa 2024-08-15 18:08:52 [detect_format] The file for Hi-C read alignments is detected as being in BAM format 2024-08-15 18:08:52 [parse_fasta] Parsing input FASTA file... 2024-08-15 18:11:01 [parse_gfa] Parsing input gfa file(s)... 2024-08-15 18:11:14 [stat_fragments] Making some statistics of fragments (contigs / bins) 2024-08-15 18:11:14 [stat_fragments] bin_size is calculated to be 2000000 bp 2024-08-15 18:11:36 [parse_alignments] Parsing input alignments... 2024-08-15 18:11:36 [output_pickle] Writing HT_link_dict to HT_links.pkl... 2024-08-15 18:11:36 [output_clm] Writing clm_dict to paired_links.clm... 2024-08-15 18:11:36 [filter_fragments] Filtering fragments... 2024-08-15 18:11:36 [filter_fragments] [Nx filtering] 1078 fragments kept 2024-08-15 18:11:36 [filter_fragments] [RE sites filtering] 0 fragments removed, 1078 fragments kept 2024-08-15 18:11:36 [filter_fragments] [link density filtering] Parameter --density_lower 0.2X is set to "multiple" mode and equivalent to 0.9916512059369202 in "fraction" mode 2024-08-15 18:11:36 [filter_fragments] [link density filtering] Parameter --density_upper 1.9X is set to "multiple" mode and equivalent to 0.9953617810760668 in "fraction" mode 2024-08-15 18:11:36 [filter_fragments] [link density filtering] 1074 fragments removed, 4 fragments kept 2024-08-15 18:11:36 [filter_fragments] [read depth filtering] Q1=16.0, median=16.0, Q3=16.0, IQR=Q3-Q1=0.0 2024-08-15 18:11:36 [filter_fragments] [read depth filtering] Parameter --read_depth_upper 1.5X is set to "multiple" mode and equivalent to 0.891465677179963 in "fraction" mode 2024-08-15 18:11:36 [filter_fragments] [read depth filtering] 0 fragments removed, 4 fragments kept 2024-08-15 18:11:36 [filter_fragments] [rank sum filtering] Q1=6.0, median=6.0, Q3=6.0, IQR=Q3-Q1=0.0 2024-08-15 18:11:36 [filter_fragments] [rank sum filtering] Parameter --rank_sum_upper 1.5X is set to "multiple" mode and equivalent to 1.0 in "fraction" mode 2024-08-15 18:11:36 [filter_fragments] [rank sum filtering] 0 fragments removed, 4 fragments kept 2024-08-15 18:11:36 [reduce_inter_hap_HiC_links] Reducing inter-haplotype Hi-C links in flank_link_dict... 2024-08-15 18:11:36 [reduce_inter_hap_HiC_links] Reducing inter-haplotype Hi-C links in full_link_dict... 2024-08-15 18:11:36 [output_pickle] Writing full_link_dict to full_links.pkl... 2024-08-15 18:11:36 [run] Hi-C linking matrix was constructed in 164.13156461715698s 2024-08-15 18:11:36 [run_mcl_clustering] Performing Markov clustering... 2024-08-15 18:11:36 [mcl] The matrix has converged after 11 rounds of iterations (expansion: 2, inflation: 1.1, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:36 [mcl] The matrix has converged after 16 rounds of iterations (expansion: 2, inflation: 1.2, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 10 rounds of iterations (expansion: 2, inflation: 1.3, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 11 rounds of iterations (expansion: 2, inflation: 1.4, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 14 rounds of iterations (expansion: 2, inflation: 1.5, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 9 rounds of iterations (expansion: 2, inflation: 1.6, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 7 rounds of iterations (expansion: 2, inflation: 1.7, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 6 rounds of iterations (expansion: 2, inflation: 1.8, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 5 rounds of iterations (expansion: 2, inflation: 1.9, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 5 rounds of iterations (expansion: 2, inflation: 2.0, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 5 rounds of iterations (expansion: 2, inflation: 2.1, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 4 rounds of iterations (expansion: 2, inflation: 2.2, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 4 rounds of iterations (expansion: 2, inflation: 2.3, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 4 rounds of iterations (expansion: 2, inflation: 2.4, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 4 rounds of iterations (expansion: 2, inflation: 2.5, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 4 rounds of iterations (expansion: 2, inflation: 2.6, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 4 rounds of iterations (expansion: 2, inflation: 2.7, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 4 rounds of iterations (expansion: 2, inflation: 2.8, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.9, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 3.0, maximum iterations: 200, pruning threshold: 0.0001) 2024-08-15 18:11:37 [run_mcl_clustering] The maximum number of clusters (2) is even less than the expected number of chromosomes (32). You could try higher inflation. 2024-08-15 18:11:37 [run] 20 round(s) of Markov clustering finished in 1.2137212753295898s, average 0.06068606376647949s per round 2024-08-15 18:11:37 [output_statistics] Making some statistics for the next HapHiC reassignment step... 2024-08-15 18:12:42 [run] Program finished in 229.71603918075562s

Originally posted by @awesomedeer in https://github.com/zengxiaofei/HapHiC/issues/25#issuecomment-2291039346

zengxiaofei commented 3 months ago

What is the size of the input assembly file allhaps.fa? Based on the calculated bin_size and the nchrs parameter you specified, the assembly size should be greater than 2 Mb 30 32 = 1.92 Gb. However, the size of the unfiltered BAM file is only 1.1 Gb, indicating that you aligned less than 1X of the Hi-C reads. After filtering, the BAM file is only 4.2 Mb. These Hi-C reads are clearly insufficient for scaffolding.

awesomedeer commented 3 months ago

allhaps.fa is around 2.69 Gb and the Hi-C raw reads are 130G. My bwa command is: bwa mem -5SP -t 20 allhaps.fa ~/raw/darwin_1.fastq.gz ~/raw/darwin_2.fastq.gz | samblaster | samtools view - -@ 14 -S -h -b -F 3340 -o HiC.bam I installed the samblaster by conda and its version is 0.1.26. Might be a problem?

zengxiaofei commented 3 months ago

You could try aligning Hi-C reads without any filtering, and then get stats for the resulting BAM file using samtools flagstat:

$ bwa mem -5SP -t 20 allhaps.fa ~/raw/darwin_1.fastq.gz ~/raw/darwin_2.fastq.gz | samblaster | samtools view - -@ 14 -S -h -b -o HiC.bam
$ samtools flagstat HiC.bam > flagstat.txt
awesomedeer commented 3 months ago

Hi, Zeng This is my flagstat.txt file 11669087 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 2397559 + 0 supplementary 238640 + 0 duplicates 11630744 + 0 mapped (99.67% : N/A) 9271528 + 0 paired in sequencing 4635764 + 0 read1 4635764 + 0 read2 0 + 0 properly paired (0.00% : N/A) 9211858 + 0 with itself and mate mapped 21327 + 0 singletons (0.23% : N/A) 7568972 + 0 with mate mapped to a different chr 101865 + 0 with mate mapped to a different chr (mapQ>=5)

Are these Hi-C files problems? I used them in Hifiasm assembling in Hi-C mode, it seems ok.

zengxiaofei commented 3 months ago

I don't know whether these stats include all your Hi-C reads. First, there are only 11,669,087 reads, indicating the sequencing data is only 1.75 Gb in 150 bp sequencing. Second, the count of Hi-C reads is an odd number, which is incorrect for paired-end sequencing. You need to check your input Hi-C reads carefully.

awesomedeer commented 3 months ago

Thanks for the reply! I will check my Hi-C input and update here.

awesomedeer commented 3 months ago

@zengxiaofei I realized that this is the problem of Hi-c raw file that I forgot to check md5. So I redownload the file and the BAM is relatively reasonable. Thanks a lot, Zeng. I will close this issue.

Best regards Song

awesomedeer commented 3 months ago

Sorry for the continuous question. Just the HiC.bam (96G) looks fine although the size of HiC.filtered.bam (539M) is small. And the chromosome error appears again. Does this mean that the Hi-C data after filtering are not enough for the scaffolding? I have tried different --max_inflation but all failed.

zengxiaofei commented 3 months ago

Yes, these filtered Hi-C reads might not be enough for HapHiC scaffolding. According to our tests on the autotetraploid Medicago sativa XinJiangDaYe, the minimum requirement for effective Hi-C depth is around 0.5X. The heterozygous sites on this genome are already evenly distributed. In your case, the effective Hi-C depth is only 0.2X. This situation may get worse if the heterozygous sites are not evenly distributed. You can scaffold each of them separately, with nchrs set to 8. But please note that the Hi-C phasing in hifiasm may lead to imbalanced haplotypes. Please see their latest article published in Nature Methods this year.

awesomedeer commented 3 months ago

I see your point. I tried the ***p_utg** rather than **hap*.p.ctg** and HiC.filtered.bam (26G) looks good after filtering. I ran the pipeline command and sets the nchrs as 32 which generated great results! HapHiC's performance is really outstanding although I can't understand why using **hap*.p.ctg** will get too less Hi-C alignment and further errors since they are generated by hifiasm the same time.

zengxiaofei commented 3 months ago

I'm glad to see that you have got satisfactory results!

I can't understand why using hap*.p.ctg will get too less Hi-C alignment and further errors since they are generated by hifiasm the same time.

The autotetraploid potato C88 assembly phased using Hi-C data in hifiasm's latest article is much larger than the C88.v1 assembly. So, I think hifiasm may have introduced more redundant sequences in hap*.p_ctg in your case, which could resulting in more Hi-C reads being filtered by the MAPQ>=0 criterion.

awesomedeer commented 3 months ago

Cheers. Thanks for your patience and thoughful answers!