Closed ghost closed 4 weeks ago
There are four types of errors that can result in the strong off-diagonal signals:
(1) Chromosome misassignment
Some contigs may be assigned to the wrong chromosomes during genome scaffolding. These contigs exihibit very weak Hi-C signals with their current chromosomes but show much stronger signals with other chromosomes. They should be moved to the correct chromosomes in Juicebox.
(2) Chimeric contigs
Regions that far apart or from different chromosomes can be misjoined into a single contig during genome assembly. Although the parameter --correct_nrounds 2
was used to correct these contigs, some of them may still not be corrected. They should be split into different parts and moved to the correct chromosomes in Juicebox.
(3) Collapsed contigs
Regions that far apart or from different chromosomes (often allelic regions from homologous chromosomes) can be merged into a single consensus contig during genome assembly due to very low heterozygosity levels or even identical sequences. Collapsed contigs are more difficult to correct than chimeric contigs. You may need to create another copy for the allelic regions.
(4) Switch errors and mapping errors
These errors are introduced during genome assembly and Hi-C read mapping. They can result in diagonally distributed Hi-C signals between homologous chromosomes. In your case, these errors are observable but not severe, no action is required.
For more information, please refer to the definitions in our paper.
Although the process of generating .mnd, .assembly, and .hic files could be annoying, the usage and the underlying principle of the Juicebox Assembly Tool GUI are straightforward for me.
In your case, some errors may stem from the genome assembly process. Hence, these errors cannot be solved solely by adjusting parameters in HapHiC. I cannot identify all these assembly errors because the contact map generated by haphic plot does not include the boundaries of contigs. However, based on the contact map, your assembly appears to be generally good. When using Juicebox, I believe an experienced researcher could address all these issues within 30 minutes.
You are 100% correct but it's that idea that it's ok to move things knowing in advance what you want that I don't like. The aim of bioinformatics should be to automate processes no? I would say if the errors stem from the assembly the correct approach is to correct the assembly. I accept it's a philosophical discussion and our views might be impossible to reconcile.
I'm sorry if I have offended you; that was my fault. As I am not a native English speaker and communication is not my strong suit, I may have unintentionally said something that could be easily misunderstood or seen as impolite. I did not doubt the professionalism of you and your team. What I meant here is that a skilled Juicebox user can quickly solve these issues, so you don't need to worry about wasting too much time. If you believe that the professionalism of an individual or team is important for understanding the current problem and finding a solution, you also don't need to worry because the Juicebox team (Aiden Lab) is very professional. Professor Aiden is one of the inventors of the Hi-C technology, and the Juicebox toolkits are widely used in this area. Another point I need to clarify to avoid misunderstandings is that I am not forcing you to use Juicebox; it is your choice how to handle your genome. I am not a fan of Juicebox, and I have no affiliation with the Juicebox team. I sincerely apologize for any confusion I may have caused.
It appears that you believe bioinformatics software should aim to automate all processes. Yes, this is definitely everyone's ultimate goal. However, there are still many difficulties at the current stage. Although different tools perform differently, I have not yet seen any tool that can consistently achieve this across various cases. It can even be said that if the scaffolding problem can be perfectly solved, as long as it is combined with the assembly graph, any genome can be directly assembled to a level close to T2T, which would be big news. The current automated scaffolding tools are designed to operate in a way that aligns with human understanding, and they usually do not actually outperform humans (although they save more time). Furthermore, errors in genomes of different types actually have obvious differences on contact maps (there are even scaffolding tools based on graphics and machine learning like AutoHiC). So, I do not think there are any issues with Juicebox. Moreover, Juicebox itself is also a bioinformatics tool, isn't it?
Please feel free to choose the tools that you think are appropriate. I have extensively used and tested most of the software examples you mentioned, and I have a deep understanding of their respective features. Each software has its unique and commendable aspects in terms of innovation and algorithms.
Regarding what you said about encountering these signals being completely caused by HapHiC, rather than errors in your genome itself, I'm sorry to say that I'm afraid I cannot agree. I have pointed out some issues in the figure below, where 'a' and 'b' are two obviously collapsed regions, indicating that the sequences of these two haplotypes were only assembled into one haplotype. To ensure the completeness of the genome, these regions need to be copied. You can verify this through HiFi read depth, as the depth of these regions will be twice as high as normal regions. As far as I know, the scaffolding tools currently available do not have the functionality to address this issue. Even if they automatically copy this region for you, it will create new difficulties during scaffolding due to multiple alignments. If you do not see this signal in the contact maps from other software, it is very likely that this region has been moved to unanchored contigs.
The blank signals (c) usually can be observed in regions that are highly similar between homologous chromosomes or are rich in repetitive sequences. This leads to multiple mappings during Hi-C read mapping, which are filtered out by the criterion MAPQ>=1. If you want to visualize these signals, you can choose not to filter out MAPQ during the plotting stage (as mentioned earlier, do not use unfiltered BAM files at the scaffolding stage, as this will result in more problems), but this may introduce a large proportion of alignment errors.
There are also some regions that I have not labeled, and I cannot determine whether they are misassignments (errors introduced by HapHiC) or chimeric contigs (errors introduced by the assembler) because of the lack of contig boundary information. If a misjoin occurs between two contigs, it is the former; otherwise, it is the latter. We have compared the effectiveness of various misjoin correction function among widely used scaffolding tools in our article. Other tools may be able to detect and correct more chimeric contigs than HapHiC (see figure below), but they are also more likely to break correctly assembled contigs. This is a trade-off. In HapHiC, we have designed a stringent misjoin correction because current HiFi data and assemblers usually do not introduce too many misjoins. We aim to ensure contiguity of contigs as much as possible and let users correct these remaining misjoins in Juicebox.
I will ask my PI what we can share with you, if you have time, and to prove my good faith we could play with different tools and discover what's happening. I don't like juice box simply because there is no guidelines and anyone could in theory use it according to their liking. I believe it's a dangerous path that would lead to potential problems if you look for a specific "assembly".
Otherwise never apologize for your English, don't worry, I was irritated by what I interpreted as a careless attitude but I see you actually care a lot, it's my mistake.
I have to say I don't believe in the assembly error but without seeing the data you can't decide for yourself. I will pass the word to my PI I don't want to lose time in a GitHub fight, if it's ok for you I will see what we can share. Also sometimes a tool doesn't work with a specific data set for totally obscure reasons.
Take care and let me check with the people I consulted, all this might be a misunderstanding. I also need to think about your observations. I promise I come back to you.
There is no wish to fight, after reflection it's an unnecessary painful loss of time Take care
Edit: I deleted the messages
Hello, Here is my Matrix, it's on a diploid assembly, but I am bothered by the very strong off diagonal signal contact_map.pdf
Here was the command:
and