zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
133 stars 9 forks source link

How does HapHiC process collapsed contigs? #9

Closed Iven-gif closed 7 months ago

Iven-gif commented 9 months ago

Hi, Thank you for developing HapHiC, it exhibits excellent performance in polyploid scaffolding. I have two questions and hope you can provide assistance: Q1: How does HapHiC process collapsed contigs/unitigs? In your schematic diagram, it appears that collapsed contigs are ultimately assigned to one of the groups. If possible, can HapHiC provide information about the identified collapsed contigs, so that I can handle these collapsed contigs separately.

Q2: If my input reference genome contains two completely identical (homozygous) contigs/unitigs, how will HapHiC allocate them? Will 'filter_bam' filter out the alignment information on these contigs?"

I would appreciate your assistance. Thank you so much!

Iven

zengxiaofei commented 9 months ago

Hi Iven,

Q1: How does HapHiC process collapsed contigs/unitigs? In your schematic diagram, it appears that collapsed contigs are ultimately assigned to one of the groups. If possible, can HapHiC provide information about the identified collapsed contigs, so that I can handle these collapsed contigs separately.

HapHiC uses multiple evidence to identify potential collapse contigs, including Hi-C link density, HiFi read depth (when the GFA file(s) output by hifiasm are provided), and neighborhood density (rank-sum value). These contigs are discarded during the Markov clustering but will be rescued in the reassignment step.

Our goal is to minimize the negative impact of collapsed contigs on Markov clustering. Therefore, the filtration process may be slightly rough. It means that some contigs that are not collapsed may also be discarded in this process. Additionally, the rank-sum method is bifunctional, chimeric contigs may also identified and removed in this step. It's worth noting that super-long contigs can adversely affect the clustering process. So, the filtering steps, as well as the subsequent Markov clustering, are performed at the "bin" level.

By adding the --verbose parameter when running the HapHiC pipeline, the program will output a full log that provides detailed information about the contigs/bins removed by each method. You can find something like:

# For link density filtering

2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Fragment h3tg000010l_bin16 is removed, density=0
2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Fragment h1tg000015l_bin10 is removed, density=0
2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Fragment h3tg000010l_bin8 is removed, density=0
2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Fragment h2tg000151l is removed, density=0
2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Fragment h1tg000015l_bin16 is removed, density=0
2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Fragment h2tg000011l_bin3 is removed, density=0

# For read depth filtering

2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [read depth filtering] Fragment h2tg000011l_bin2 is removed, read depth=74
2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [read depth filtering] Fragment h2tg000011l_bin23 is removed, read depth=74
2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [read depth filtering] Fragment h1tg000015l_bin2 is removed, read depth=75
2024-01-22 13:57:28 <HapHiC_cluster.py> [filter_fragments] [read depth filtering] Fragment h1tg000015l_bin1 is removed, read depth=75

# For rank sum filtering

2024-01-22 13:57:29 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Fragment h3tg000089l_bin2 is removed, rank sum=485
2024-01-22 13:57:29 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Fragment h3tg000003l_bin3 is removed, rank sum=487
2024-01-22 13:57:29 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Fragment h3tg000089l_bin3 is removed, rank sum=511
2024-01-22 13:57:29 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Fragment h1tg000046l_bin3 is removed, rank sum=544
2024-01-22 13:57:29 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Fragment h4tg000098l is removed, rank sum=567
2024-01-22 13:57:29 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Fragment h1tg000031l_bin6 is removed, rank sum=576

Q2: If my input reference genome contains two completely identical (homozygous) contigs/unitigs, how will HapHiC allocate them? Will 'filter_bam' filter out the alignment information on these contigs?

Yes, by default, filter_bam will filter out ambiguously mapped reads from these contigs due to their low MAPQ values. Without sufficient Hi-C links, these completely identical contigs will not be anchored to any chromosomes.

Best regards, Xiaofei

Iven-gif commented 9 months ago

Thank you very much for your explanation.

I would like to ask if all the removed contigs mentioned in the above log file also include chimeric contigs.

Additionally, I would like to inquire about the use of '--density_upper 0.9 --rank_sum_upper 0.8' in your paper for filtering collapsed contigs of S. spontaneum Np-X. I understand the explanation of the parameters, but I'm not clear about the rationale behind this specific setting. Could you please provide a brief explanation? This will help me make corresponding adjustments for other genomes in the future.

Best regards! Iven

zengxiaofei commented 9 months ago

Hi Iven,

I would like to ask if all the removed contigs mentioned in the above log file also include chimeric contigs.

Yes, chimeric contigs/bins are also included in the log file under [rank sum filtering]. Chimeric and collapsed contigs/bins are not differentiated in the log file.

Additionally, I would like to inquire about the use of '--density_upper 0.9 --rank_sum_upper 0.8' in your paper for filtering collapsed contigs of S. spontaneum Np-X. I understand the explanation of the parameters, but I'm not clear about the rationale behind this specific setting. Could you please provide a brief explanation? This will help me make corresponding adjustments for other genomes in the future.

Although identifying misassembled contigs using the Hi-C link density and rank-sum method are effective, but selecting appropriate thresholds for them can be challenging. We use 1.9 times the average Hi-C link density and Q3+1.5IQR of rank-sum values as the default thresholds, which work well in most cases. However, when there is a high number of misassembled contigs in the assembly, it may be necessary to manually set these thresholds. This is the case with the genome assembly of S. spontaneum* Np-X. --density_upper 0.9 means the contigs/bins with top 10% highest link densities will be filtered out, while --rank_sum_upper removes the contigs/bins with top 20% highest rank-sum values.

Best regards, Xiaofei