zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
142 stars 10 forks source link

The signal on one end of pseudochromosomes is missing #77

Closed tinyfallen closed 1 month ago

tinyfallen commented 1 month ago

Dear developers,

Thanks for your great software!

I am working on assembling a putative hexaploid or octoploid plant genome with a genome size of ~1.45 Gb by flow cytometry. In addition, the only one published closely related species has 2n=2X=28 karyotype, 400 Mb haploid genome size, and the CCDB shows the chromosome base number of this genus may be 9 or 14.

I obtained 54 pseudochromosomes by scaffolding the primary assembly generated by hifiasm with ~41x hifi reads and ~140x Hi-C reads. The colinear plot between this genome and the relative species indicated it may have undergone chromosome fusion events. I have several questions doing scaffolding.

(1) Can I determine how many the chromosome sets are by Hi-C analysis of haphic? (2) Is there any possibility of a complex chromosome number variations occurring? like ((14 x 2) - 1 ) x 2 (3) The signal on one end of a considerable number of pseudochromosomes is missing, should I break the contigs and move this regions to debris? image image

(4) Some pseudochromosomes centromeric regions or ends are just consist of debris which have nearly no signal, should I remove them? image image

tinyfallen commented 1 month ago

According to the interaction signal among pseudochromosomes, I think it should be an octoploid. Typically, a pseudochromosome has one strongly and two weakly interacting pseudochromosomes. image

zengxiaofei commented 1 month ago

(1) Can I determine how many the chromosome sets are by Hi-C analysis of haphic?

This information can be inferred from 1) self-alignment, 2) alignment to other species, and (3) diagonally distributed Hi-C signals on contact maps.

(2) Is there any possibility of a complex chromosome number variations occurring? like ((14 x 2) - 1 ) x 2

For bioinformatics analysis, drawing such a conclusion requires more information and experience. For example: a full version of contact map without clustering errors, the locations of telomeres and centromeres on each chromosome, alignments to closely related species using dot plots, read depth distribution across each chromosome, etc.

(3) The signal on one end of a considerable number of pseudochromosomes is missing, should I break the contigs and move this regions to debris? (4) Some pseudochromosomes centromeric regions or ends are just consist of debris which have nearly no signal, should I remove them?

Regions lacking Hi-C signals are typically either repetitive sequences or regions that are extremely similar among different haplotypes. During BAM filtering, these signals are filtered out by the MAPQ ≥ 1 criterion due to multiple alignments. The handling of these sequences should be considered case by case. If assemblers join these regions with signal-rich regions into the same contig, I generally regard the results as reliable, as the assemblers use additional information from overlapping HiFi reads. Conversely, the placements of independently existing signal-lacking contigs are considered unreliable if their placements rely solely on weak Hi-C signals. The placements of these sequences can be manually determined based on the GFA files generated by assemblers using bandage. If their placements remain uncertain, they can be moved to debris.

More information: https://github.com/zengxiaofei/HapHiC/issues/2#issuecomment-1871681093

According to the interaction signal among pseudochromosomes, I think it should be an octoploid. Typically, a pseudochromosome has one strongly and two weakly interacting pseudochromosomes.

Before providing further comments, I need to know how this genome was assembled, including the version of the assembly used (p_ctg, p_utg, hap*.p_ctg).

tinyfallen commented 1 month ago

已收到您发送的邮件!

tinyfallen commented 1 month ago

Thank you very much for your suggestion! I also think keeping the original assembling results should be more reliable instead of breaking up contigs. And for the Hi-C link near-absent small contigs, I will check their read mapping depth.

We vastly underestimated the size of this genome before sequencing. I think the sequencing depth (~43x) was not enough for accurately separate haplotypes, let alone it may be a autopolyploid, even if we have enough Hi-C reads. So I chose the p_ctg as input for scaffolding a haplotype collapsed genome. Do you think haplotyping is better?

As for this plot, I noticed haphic uses MAPQ ≥ 1 as the default filtering settings. I am not sure this may erase the inter-pseudochromosomes contact signals or not, so I kept the reads whose MAPQ = 0 to draw this plot for a test.

According to the interaction signal among pseudochromosomes, I think it should be an octoploid. Typically, a pseudochromosome has one strongly and two weakly interacting pseudochromosomes. image

zengxiaofei commented 1 month ago

So I chose the p_ctg as input for scaffolding a haplotype collapsed genome.

The p_ctg version is not correctly phased. The contigs in it may contain randomly joined long phased blocks. We typically use hap*.p_ctg for diploids and p_utg for autopolyploids. For more information, please refer to https://lh3.github.io/2021/04/17/concepts-in-phased-assemblies and https://hifiasm.readthedocs.io/en/latest/interpreting-output.html#interpreting-output.

As for this plot, I noticed haphic uses MAPQ ≥ 1 as the default filtering settings. I am not sure this may erase the inter-pseudochromosomes contact signals or not, so I kept the reads whose MAPQ = 0 to draw this plot for a test.

The MAPQ ≥ 1 criterion does not specifically filter out inter-chromosome Hi-C signals. Instead, it mainly removes multiple alignments. The diagonally distributed Hi-C links between homologous chromosomes stem from mapping errors and assembly errors.

tinyfallen commented 1 month ago

Thank you for you advice!

I have tried unitigs for scaffolding, but the utg N50 (1.2Mb) is much smaller than ctg N50 (21Mb), resulting much more fragmented pseudochromosomes and making it more complicated to scaffold. Besides, the final size of pseudochromosomes is only a few Mb larger than ctg version. So we decide using ctg for scaffolding.

As for Hi-C signal filtering, I am curious about that if the sequences of two or multiple chromosomal regions are near the same, will this MAPQ ≥ 1 filtering criterion cause all the signals in these regions to disappear?

zengxiaofei commented 1 month ago

I have tried unitigs for scaffolding, but the utg N50 (1.2Mb) is much smaller than ctg N50 (21Mb), resulting much more fragmented pseudochromosomes and making it more complicated to scaffold. Besides, the final size of pseudochromosomes is only a few Mb larger than ctg version. So we decide using ctg for scaffolding.

You are free to persist in how you conduct your research and take responsibility for it, and the heatmaps you provided do not appear to have obvious anomalies, however, considering that GitHub Issues are also visible to other users, as the author of HapHiC and the maintainer of this GitHub repository, I still need to express my personal opinion and stance: I think these reasons are not sufficient to convince me to choose a conceptually incorrect assembly version. A lower contiguity is preferable to being incorrect.

As for Hi-C signal filtering, I am curious about that if the sequences of two or multiple chromosomal regions are near the same, will this MAPQ ≥ 1 filtering criterion cause all the signals in these regions to disappear?

Although MAPQ is defined as −10 log10 Pr{mapping position is wrong} in the Sequence Alignment/Map Format Specification, the calculation of MAPQ varies among different read mappers. I can only say that the MAPQ ≥ 1 criterion is the minimum requirement to prevent multiple mappings. This threshold is also used by several Hi-C-related software tools, including 3D-DNA and chromap.

Here is a comparison of various MAPQ cutoffs in the genome assembly of the autotetraploid potato C88:

image

tinyfallen commented 1 month ago

A lower contiguity is preferable to being incorrect.

I think this statement makes sense, however, there are about 2500 unitigs which have been assigned to pseudochromosomes by haphic. The complexity of scaffolding these thousands of unitigs by only Hi-C evidence raises the concern that I can not determine whether unitig scaffolding is more correct than the potential chimeric contigs generated by assembler or not. Do you have any suggestions?

In addition, the Hi-C signals among contigs generally looks good, like the heatmaps I posted. The author of hifiasm suggested usually >=13x HiFi reads per haplotype are desired. Higher coverage might be able to improve the contiguity of assembly. We may not accurately phase eight sets of haplotypes by ~40x reads, a haplotype-collapsed genome is acceptable. Maybe I should ask my boss for more sequencing depth and other scaffolding evidences ~

By the way, I am confused about the "primary assembly" described here (Are polyploid genomes supported), whether it is p_ctg or p_utg? I used --n-hap 8 in running hifiasm.

Here is a comparison of various MAPQ cutoffs in the genome assembly of the autotetraploid potato C88:

Thanks for your example, it seems more obvious than my case. In my opinion, we should make use of the multiple mapping reads instead of filtering them along with the low mapping quality reads, I am not sure it is correct or not, do you think multiple mapping reads should be treated as the same as low mapping quality reads?

tinyfallen commented 1 month ago

Thanks to your suggestion, I found some assembly error in contigs by checking the hifi reads mapping depth on the pseudochromosomes where the depth changes suddenly and Hi-C signals are absent. There may be more such errors hiding in the contigs.

At present, I can only judge by feeling that homologous collapse and heterologous chimerism may occur together in some contigs. And I face the issue how to find out them and how to revise them. Do you think it makes sense using contigs to assist scaffolding those unitigs or using corresponding unitigs to replace those flawed contigs?

Hope these ignorant questions won't take up too much of your time, thank you very much!

zengxiaofei commented 1 month ago

I think this statement makes sense, however, there are about 2500 unitigs which have been assigned to pseudochromosomes by haphic. The complexity of scaffolding these thousands of unitigs by only Hi-C evidence raises the concern that I can not determine whether unitig scaffolding is more correct than the potential chimeric contigs generated by assembler or not. Do you have any suggestions?

They are different types of errors. Short contigs (often < 100 kb) do have a higher potential to be erroneously ordered and oriented. These kinds of errors can be alleviated by using the assembly graph output by the assemblers. However, the primary assembly (p_ctg) can introduce phasing errors (see the top right panel in the figure below).

image

We may not accurately phase eight sets of haplotypes by ~40x reads, a haplotype-collapsed genome is acceptable.

First, the concepts of the primary assembly and the haplotype-collapsed assembly are different. Furthermore, in your case, the primary assembly contains four times the haploid genome size due to high heterozygosity.

By the way, I am confused about the "primary assembly" described here (Are polyploid genomes supported), whether it is p_ctg or p_utg?

The "primary assembly" is p_ctg, while the sequences in p_utg are unitigs. p_ctg may not be correctly phased. For diploids, when you have provided trio data or Hi-C data, the generated hap*.p_ctg files are often correctly phased. If you did not provide trio data or Hi-C data, adding the --n-hap parameter puts hifiasm in "dual" mode, and the generated hap*.p_ctg files are not correctly phased (see the top middle panel in the figure above). However, for autopolyploids, hifiasm may generate imbalanced haplotypes (hap*.p_ctg). Therefore, researchers often choose unitigs (p_utg) for scaffolding, which are typically correctly phased but not assigned to haplotypes. This is why the authors of hifiasm drew the conclusion in their documentation:

The r_utg.gfa and p_utg.gfa are lossless so that they also work for polyploid genomes.

zengxiaofei commented 1 month ago

Close this issue as there has been no response for two weeks.

tinyfallen commented 1 month ago

Sorry for the late reply, I went on a meeting trip recently. Thank you very much for your patient answers, they are very helpful for me!