zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
142 stars 10 forks source link

Scaffolding of Micro-chromosomes #72

Closed hrluo93 closed 2 months ago

hrluo93 commented 2 months ago

Hi,

Thanks for developing such wonderful tools.

I am working on birds with many, many micro-chromosomes. I have some questions. I am looking forward to hearing from you.

For the input contig assembly from NextDenovo, I tried both HiC.bam and HiC.filtered.bam.

  1. One-step pipeline reported that "some chromosomes were grouped together, or the maximum number of clusters(1) is even less than the expected number of chromosomes(31). Then I increased --max_inflation to 30 with step 0.3 running step-by-step, but I still failed in the Clustering. Should I further increase --max_inflation?
  2. I used the quick-view model and found many small contigs with no HiC contact signal. Then, I sorted HiC.bam by name and input in Yahs; these small contigs showed increased HiC contact signals. Does this mean Yahs may be more suitable for birds?
  3. I also note that Yahs changed the contig name Ctg000001 to Ctg000001.1, dose Ctg000001.1 is the same as Ctg000001 when I set "--no-contig-ec --no-scaffold-ec"?
  4. The last question is can I use porec_paired.bam converted by Haphic prepare python script to Yahs, If Yahs more suitable for birds?

Best wishes! Haoran

zengxiaofei commented 2 months ago
  1. One-step pipeline reported that "some chromosomes were grouped together, or the maximum number of clusters(1) is even less than the expected number of chromosomes(31). Then I increased --max_inflation to 30 with step 0.3 running step-by-step, but I still failed in the Clustering. Should I further increase --max_inflation?
  2. I used the quick-view model and found many small contigs with no HiC contact signal. Then, I sorted HiC.bam by name and input in Yahs; these small contigs showed increased HiC contact signals. Does this mean Yahs may be more suitable for birds?

Without contact maps and enough information, I cannot comment on the issues you’ve encountered or your observations. In addition, Hi-C contact signals are fully represented in the BAM file. Unless different filtering criteria are applied to the two tools, it is impossible to claim that “these small contigs showed increased Hi-C contact signals.” Furthermore, the BAM file requirements for HapHiC are consistent with those for YaHS. Therefore, if you generate the BAM file as I recommend, there is no need to sort it by name again.

  1. I also note that Yahs changed the contig name Ctg000001 to Ctg000001.1, dose Ctg000001.1 is the same as Ctg000001 when I set "--no-contig-ec --no-scaffold-ec"?

The correspondence of contig ID is saved in *.liftover.agp.

  1. The last question is can I use porec_paired.bam converted by Haphic prepare python script to Yahs, If Yahs more suitable for birds?

Yes. porec_paired.bam meets the format requirements of YaHS for the input file.

hrluo93 commented 2 months ago
  1. One-step pipeline reported that "some chromosomes were grouped together, or the maximum number of clusters(1) is even less than the expected number of chromosomes(31). Then I increased --max_inflation to 30 with step 0.3 running step-by-step, but I still failed in the Clustering. Should I further increase --max_inflation?
  2. I used the quick-view model and found many small contigs with no HiC contact signal. Then, I sorted HiC.bam by name and input in Yahs; these small contigs showed increased HiC contact signals. Does this mean Yahs may be more suitable for birds?

Without contact maps and enough information, I cannot comment on the issues you’ve encountered or your observations. In addition, Hi-C contact signals are fully represented in the BAM file. Unless different filtering criteria are applied to the two tools, it is impossible to claim that “these small contigs showed increased Hi-C contact signals.” Furthermore, the BAM file requirements for HapHiC are consistent with those for YaHS. Therefore, if you generate the BAM file as I recommend, there is no need to sort it by name again.

  1. I also note that Yahs changed the contig name Ctg000001 to Ctg000001.1, dose Ctg000001.1 is the same as Ctg000001 when I set "--no-contig-ec --no-scaffold-ec"?

The correspondence of contig ID is saved in *.liftover.agp.

  1. The last question is can I use porec_paired.bam converted by Haphic prepare python script to Yahs, If Yahs more suitable for birds?

Yes. porec_paired.bam meets the format requirements of YaHS for the input file.

Many thanks for your reply!

The karyotype of the species is 2n=60+ZZ. The commands and parameters used in running HapHiC is: "/media/perimeter/r2/srcs/HapHiC/haphic cluster /media/perimeter/r2/eeu/eeuontasm/03.ctg_graph/nd.asm.w60.fasta /media/perimeter/r2/eeu/eeuontasm/03.ctg_graph/HiC.filtered.bam 31 --max_inflation 30 --inflation_step 0.3 The log files generated by HapHiC HapHiC_cluster.log

The methods (commands) used for Hi-C read mapping and filtering: as Hap HIC recommended.

The method used for genome assembly ONT ultra long reads to Nextdenovo without polish. Statistics for the assembly input into HapHiC Type Length (bp) Count (#) N10 220407187 1 N20 168975251 2 N30 168975251 2 N40 127899466 3 N50 84066100 5 N60 70663523 6 N70 48140589 9 N80 24783511 12 N90 18726849 18

Min. 72153 - Max. 220407187 - Ave. 18954379 - Total 1232034693 65

Best wishes! Haoran

zengxiaofei commented 2 months ago

Although I still have not seen your contact maps, it doesn't matter. If you have carefully read our paper or listened to our presentation on bilibili, you should know that HapHiC and YaHS are two different types of scaffolders. HapHiC usually requires a priori knowledge of the number of chromosomes and prefers a distribution of chromosome lengths, while YaHS does not. These trade-offs often result in HapHiC being more likely to achieve chromosome-level scaffolding results with a higher anchoring rate, while YaHS may have more difficulty with these. However, YaHS tends to be more stable when the number of chromosomes is unknown, or when there are significant differences in chromosome lengths.

In your case, I noticed that your L90 is only 18, indicating that the setting of 31 includes microchromosomes. Due to microchromosomes being much shorter compared to other chromosomes, HapHiC usually does not recognize them as proper chromosomes, hence the clustering number never reaches 31. To address this issue, it would be better to set nchrs to the number after subtracting the microchromosomes. Identification and scaffolding of microchromosomes require manual adjustment. Alternatively, you can start manual adjustments directly using the results from the quick view.

Regarding your statement that YaHS exhibits stronger Hi-C contact signals on short contigs compared to HapHiC, it’s important to note that the Hi-C contact signals on contact maps are independent of these scaffolding tools. Instead, they depend solely on your alignment and filtering methods. This observation might be due to other errors in your process.

Overall, if you believe that YaHS produces better results, please feel free to continue using it. However, if you prefer to use HapHiC, I would recommend reducing the number of chromosomes (nchrs) to the number of main chromosomes or just using the quick view mode. Since your contig length is already long enough, different scaffolders will not show significant differences.

hrluo93 commented 2 months ago

"This observation might be due to other errors in your process." Yes, I have checked the raw file. I used filtered bam to Haphic and unfiltered bam to Yash and gave an error statement. Sorry for my carelessness. HapHiC is an excellent software that is easy to use and handles haplotype assembly. I would try to reduce the "nchrs" setting.

Many thanks again!