zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
142 stars 10 forks source link

Each group only contains one contig #29

Closed YanChunL closed 5 months ago

YanChunL commented 5 months ago

I'm sorry to bother you. I encountered two issue while using it。 (1) in the clustering stage. The number of contig participating in the clustering is very low, and in the final results, there is only one contig in each group. (2)During the clustering stage, contig are fragmented into a large number of small segments, totaling over 70,000 fragments. Ultimately, this results in over 70,000 groups, with only one fragment per group.

zengxiaofei commented 5 months ago

I need more information to understand the problem: (1) Your commands and parameters used in running HapHiC. (2) Please upload the whole log files generated in 01.cluster and 02.reassign. (3) Please upload the mcl*.txt file generated in the directory of recommended inflation under 01.cluster. (4) Please calculate the N10-N90 and L10-L90 of the contigs input into HapHiC using this tool: fa_detail.gz

$ gunzip fa_detail.gz
$ chmod 755 fa_detail
$ fa_detail asm.fa
YanChunL commented 5 months ago

(1)haphic pipeline ../00.data/asm.fa HiC.filtered.bam 36 --bin_size 8000 --RE "GATC" --Nx 80 --min_inflation 1.1 --max_inflation 3.0 --inflation_step 0.5 --density_lower 0 --density_upper 1 --threads 10 --processes 10 --correct_nrounds 2

(2)HapHiC_reassign.log HapHiC_cluster.log

(3)mcl*.txt file is too big. Here's some sample data

Group nContigs Contigs

group1_8489955bp 1 utg081377l group2_4381754bp 1 utg000554l group3_4363227bp 1 utg035823l group4_4322024bp 1 utg047822l group5_4068072bp 1 utg047840l group6_3950194bp 1 utg081742l group7_3901352bp 1 utg000333l group8_3874081bp 1 utg081384l group9_3427345bp 1 utg048184l group10_3231483bp 1 utg047789l group11_3110892bp 1 utg081613l group12_3103527bp 1 utg043876l group13_2925879bp 1 utg081437l group14_2922934bp 1 utg038776l group15_2590973bp 1 utg081357l group16_2533037bp 1 utg048062l group17_2519012bp 1 utg047990l group18_2401887bp 1 utg013509l group19_2378966bp 1 utg048116l group20_2328545bp 1 utg081455l group21_2315231bp 1 utg021157l group22_2304597bp 1 utg032661l group23_2287762bp 1 utg009325l group24_2283224bp 1 utg074002l group25_2165544bp 1 utg036224l group26_2142123bp 1 utg000481l group27_2122842bp 1 utg031713l group28_2102942bp 1 utg013104l group29_2061157bp 1 utg081341l

YanChunL commented 5 months ago

This is the first case you need the file,This is a haploid genome (1)/public/agis/panweihua_group/liyanchun/HapHiC/haphic pipeline ../00.data/asm.fa HiC.filtered.bam 12 --pruning 0.00001 --density_lower 0 --density_upper 1 --min_inflation 1.1 --max_inflation 3.0 --inflation_step 0.75 --RE "GATC" --threads 10 --processes 10 --correct_nrounds 2 (2)(3) HapHiC_cluster.log mcl_inflation_1.1.clusters.txt mcl_inflation_1.85.clusters.txt mcl_inflation_2.60.clusters.txt mcl_inflation_3.35.clusters.txt mcl_inflation_3.35.clusters.txt (4) 1717557005593

YanChunL commented 5 months ago

(1)haphic pipeline ../00.data/asm.fa HiC.filtered.bam 36 --bin_size 8000 --RE "GATC" --Nx 80 --min_inflation 1.1 --max_inflation 3.0 --inflation_step 0.5 --density_lower 0 --density_upper 1 --threads 10 --processes 10 --correct_nrounds 2

(2)HapHiC_reassign.log HapHiC_cluster.log

(3)mcl*.txt file is too big. Here's some sample data #Group nContigs Contigs group1_8489955bp 1 utg081377l group2_4381754bp 1 utg000554l group3_4363227bp 1 utg035823l group4_4322024bp 1 utg047822l group5_4068072bp 1 utg047840l group6_3950194bp 1 utg081742l group7_3901352bp 1 utg000333l group8_3874081bp 1 utg081384l group9_3427345bp 1 utg048184l group10_3231483bp 1 utg047789l group11_3110892bp 1 utg081613l group12_3103527bp 1 utg043876l group13_2925879bp 1 utg081437l group14_2922934bp 1 utg038776l group15_2590973bp 1 utg081357l group16_2533037bp 1 utg048062l group17_2519012bp 1 utg047990l group18_2401887bp 1 utg013509l group19_2378966bp 1 utg048116l group20_2328545bp 1 utg081455l group21_2315231bp 1 utg021157l group22_2304597bp 1 utg032661l group23_2287762bp 1 utg009325l group24_2283224bp 1 utg074002l group25_2165544bp 1 utg036224l group26_2142123bp 1 utg000481l group27_2122842bp 1 utg031713l group28_2102942bp 1 utg013104l group29_2061157bp 1 utg081341l

1717557415102 The process gets killed when I run fa_detail, maybe this file is too big? And,this is a triploid genome

zengxiaofei commented 5 months ago

(1) I guess you used the same FASTQ file for both ends of Hi-C reads, please refer to #27. This is the main problem. (2) Please use the default parameter for your first attempt. Both --bin_size 8000 and --inflation_step 0.5 are too large to get an optimal clustering result. In addition, --density_lower 0 and --density_upper 1 are not recommended. You may misunderstand the meaning of some parameters. Using the --bin_size parameter won't truly break your contigs into fragments, it's just a strategy to improve the clustering result by reducing the relative intensity of unfavorable background Hi-C signals.

The process gets killed when I run fa_detail, maybe this file is too big?

This tool is memory-efficient, your task could be killed by other rules, e.g., execution time limit. But it's okay, I have understood your problem.

YanChunL commented 5 months ago

(1) I guess you used the same FASTQ file for both ends of Hi-C reads, please refer to #27. This is the main problem. (2) Please use the default parameter for your first attempt. Both --bin_size 8000 and --inflation_step 0.5 are too large to get an optimal clustering result. In addition, --density_lower 0 and --density_upper 1 are not recommended. You may misunderstand the meaning of some parameters. Using the --bin_size parameter won't truly break your contigs into fragments, it's just a strategy to improve the clustering result by reducing the relative intensity of unfavorable background Hi-C signals.

The process gets killed when I run fa_detail, maybe this file is too big?

This tool is memory-efficient, your task could be killed by other rules, e.g., execution time limit. But it's okay, I have understood your problem.

I used the sample code on github to mapped and filtered the Hi-C reads ,and--inflation_step 0.5 is the recommend in the cluster log,and if I do not add this parameter--density_lower 0 and `--density_upper 1 it will filter out most of the fragments and leave only three

zengxiaofei commented 5 months ago

I used the sample code on github to mapped and filtered the Hi-C reads

Could you please show me the commands for read mapping and filtering? Please also paste some alignments in SAM format:

$ samtools view HiC.filtered.bam | cut -f 1-5 | head -20

--inflation_step 0.5 is the recommend in the cluster log

HapHiC does NOT have any recommendations on the inflation step length, there is only a recommendation on the inflation itself. Lower inflation step length is always better. By default, this value is set to 0.1 due to the balance of speed and performance.

if I do not add this parameter--density_lower 0 and `--density_upper 1 it will filter out most of the fragments and leave only three

We need to solve the problem that may be introduced in read mapping and filtering first. Discussion on these parameters is meaningless until the main problem has been solved.

YanChunL commented 5 months ago

I used the sample code on github to mapped and filtered the Hi-C reads

Could you please show me the commands for read mapping and filtering? Please also paste some alignments in SAM format:

$ samtools view HiC.filtered.bam | cut -f 1-5 | head -20

--inflation_step 0.5 is the recommend in the cluster log

HapHiC does NOT have any recommendations on the inflation step, there is only a recommendation on the inflation itself. Lower inflation step length is always better. By default, this value is set to 0.1 due to the balance of speed and performance.

if I do not add this parameter--density_lower 0 and `--density_upper 1 it will filter out most of the fragments and leave only three

We need to solve the problem that may be introduced in read mapping and filtering first. Discussion on these parameters is meaningless until the main problem has been solved.

bwa mem -5SP -t 32 ../00.data/asm.fa ../00.data/mockTriploid_r1.fq.gz ../00.data/mockTriploid_r1.fq.gz | samblaster | samtools view - -@ 32 -S -h -b -F 3340 -o HiC.bam

(2) Filter the alignments with MAPQ 1 (mapping quality ≥ 1) and NM 3 (edit distance < 3)

filter_bam HiC.bam 1 --nm 3 --threads 32 | samtools view - -b -@ 32 -o HiC.filtered.bam 1717567689121

zengxiaofei commented 5 months ago

As I guessed, you used the same FASTQ file ../00.data/mockTriploid_r1.fq.gz for both ends of the Hi-C reads, which is incorrect. The correct command should be something like:

bwa mem -5SP -t 32 ../00.data/asm.fa ../00.data/mockTriploid_r1.fq.gz ../00.data/mockTriploid_r2.fq.gz | samblaster | samtools view - -@ 32 -S -h -b -F 3340 -o HiC.bam
YanChunL commented 5 months ago

As I guessed, you used the same FASTQ file ../00.data/mockTriploid_r1.fq.gz for both ends of the Hi-C reads, which is incorrect. The correct command should be something like:

bwa mem -5SP -t 32 ../00.data/asm.fa ../00.data/mockTriploid_r1.fq.gz ../00.data/mockTriploid_r2.fq.gz | samblaster | samtools view - -@ 32 -S -h -b -F 3340 -o HiC.bam

Oh, this is indeed my negligence. I am very sorry to bother you for such a long time because of my negligence. Finally, thank you for your answer,and I will try the correct command .Thank you very much!

zengxiaofei commented 5 months ago

Oh, this is indeed my negligence. I am very sorry to bother you for such a long time because of my negligence. Finally, thank you for your answer,and I will try the correct command .Thank you very much!

No problem. Feel free to ask me if you have any questions.