Closed xiekunwhy closed 6 months ago
Hi Kun,
Thanks for your feedback! Could you please show me your command lines and paste the log files output by HapHiC? This information will be of great help.
Best regards, Xiaofei
Hi, Kun
There's nothing wrong with your commands.
You could try using the parameter --skip_allhic
in haphic sort
. It will help me figure out whether the problem occurred during ALLHiC optimization. I have observed that in rare cases, optimization may introduce large-scale ordering and orientation errors.
Furthermore, could you please visualize your scaffolding result in Juicebox? I need to check whether there are misjoins in the contigs of the assembly.
Best regards, Xiaofei
Hi, Xiaofei
the perfomance is much better than default when adding --skip_allhic parameter, here is haphic sort --skip_allhic results heatmap (group5 is butterfly miss-connected?),
the errors between group1 and group4 is caused by contig errors (ptg000002l and ptg000030l), here is contig heatmap, preparing Juicebox is time consuming, so I just plot contig heatmap here (>1Mb contig was kept).
Best, Kun
Here is the agp file (haphic sort --skip_allhic) scaffolds.agp.zip
Hi Kun,
I'm glad to see that your result has improved significantly. I also have a plan to fix the issue raised by ALLHiC.
In group5, there could be a misconnection between telomeres. This problem may also be improved in subsequent updates.
To correct the misjoins in contigs, you can try using the parameter --correct_nrounds 2
in haphic cluster
, or manually curating the result in Juicebox. The assembly correction algorithm implemented in HapHiC is more stringent than other scaffolding tools, so it should not significantly affect the contig contiguity of your assembly.
Best regards, Xiaofei
Hi, Xiaofei
Glad to hear that and hope new version comming soon.
Thank you for your suggestions, I will try it.
Best, Kun
Hi, Xiaofei
Just a suggestion.
Would you please don't change contig names in final agp file when running pipeline with --correct_nrounds (>0)
options, use original coordinates is more convenience to other to lift things (for example, gene annotation gff) from contig coordinates to the final scaffold coordinates.
HapHiC use *_breakn as new contig name, and the coordinates of all break pieces are re-counting from 1, users need to write a script to covert names and coordinates back, some one can not write such script well, I think.
YaHS keep original contig names and coordinates in final results, if I remember correctly.
Best, Kun
Hi Kun,
Thanks for your suggestion!
I will improve it in the subsequent updates.
Best regards, Xiaofei
Hi, Xiaofei
Just a suggestion.
Would you please don't change contig names in final agp file when running pipeline with
--correct_nrounds (>0)
options, use original coordinates is more convenience to other to lift things (for example, gene annotation gff) from contig coordinates to the final scaffold coordinates.HapHiC use *_breakn as new contig name, and the coordinates of all break pieces are re-counting from 1, users need to write a script to covert names and coordinates back, some one can not write such script well, I think.
YaHS keep original contig names and coordinates in final results, if I remember correctly.
Best, Kun
Hi Kun,
I have addressed this issue in the latest version of HapHiC (version 1.0.1). For more details, please read the README.md
.
I will start to handle the problem regarding the contig ordering and orientation. But it still needs some time for me.
Best regards, Xiaofei
Hi Xiaofei,
Thank you very much, I will try it.
Best, Kun
In the latest commit of HapHiC, I have fixed a bug which can lead to a Segmentation fault in running juicebox.sh
.
Xiaofei
Hi,
sklearn in https://github.com/zengxiaofei/HapHiC/blob/main/conda_env/create_conda_env_py310.sh means scikit-learn, right? I got error log when I want to re-install a newer version. The errors like following
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple/ sklearn
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/
Collecting sklearn
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/46/1c/395a83ee7b2d2ad7a05b453872053d41449564477c81dc356f720b16eac4/sklearn-0.0.post12.tar.gz (2.6 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [15 lines of output]
The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
rather than 'sklearn' for pip commands.
Here is how to fix this error in the main use cases:
- use 'pip install scikit-learn' rather than 'pip install sklearn'
- replace 'sklearn' by 'scikit-learn' in your pip requirements files
(requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
- if the 'sklearn' package is used by one of your dependencies,
it would be great if you take some time to track which package uses
'sklearn' instead of 'scikit-learn' and report it to their issue tracker
- as a last resort, set the environment variable
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
More information is available at
https://github.com/scikit-learn/sklearn-pypi-package
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Best, Kun
Yes, to install a newer version of sklearn, you should pip3 install scikit-learn
, just like what we did in https://github.com/zengxiaofei/HapHiC/blob/main/conda_env/create_conda_env_py311.sh:
conda create -n haphic_py311 python=3.11.2
conda activate haphic_py311
pip3 install numpy scipy matplotlib
pip3 install scikit-learn networkx pysam
conda install -c intel mkl
conda config --add channels conda-forge
conda install sparse_dot_mkl
pip3 install portion
conda env export > environment_py311.yml
Best regards, Xiaofei
Hi,
Just a bug, haphic pipeline with --correct_nrounds > 0, if there is no contig need to be corrected, corrected_ctgs.txt will not be written, but 04.build step still reading corrected_ctgs.txt file.
2023-12-15 18:53:51 <HapHiC_sort.py> [fast_sort] [group7_48742957bp] Checking the content of input group file...
2023-12-15 18:53:51 <HapHiC_sort.py> [fast_sort] [group7_48742957bp] Starting fast sorting iterations...
2023-12-15 18:53:51 <HapHiC_sort.py> [run] Program finished in 8.638415098190308s
2023-12-15 18:53:51 <HapHiC_pipeline.py> [haphic_build] Step4: Build final scaffolds (pseudomolecules)...
2023-12-15 18:53:51 <HapHiC_build.py> [run] Program started, HapHiC version: 1.0.2 (update: 2023.12.15)
2023-12-15 18:53:51 <HapHiC_build.py> [run] Python version: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0]
2023-12-15 18:53:51 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2023-12-15 18:54:03 <HapHiC_build.py> [parse_tours] Parsing tour files...
Traceback (most recent call last):
File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_pipeline.py", line 483, in <module>
main()
File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_pipeline.py", line 476, in main
haphic_build(args)
File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_pipeline.py", line 437, in haphic_build
HapHiC_build.run(args, log_file=LOG_FILE)
File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_build.py", line 267, in run
corrected_ctg_set = parse_corrected_ctgs(args.corrected_ctgs)
File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_build.py", line 65, in parse_corrected_ctgs
with open(corrected_ctgs) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/state/partition1/WORK/Bio/Project/xiek/pipe/hic/GDDB22060177/haphic_ecc/01.cluster/corrected_ctgs.txt'
Traceback (most recent call last):
File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/haphic", line 96, in <module>
subprocess.run(commands, check=True)
File "/Bio/User/kxie/software/miniforge3/envs/haphic/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_pipeline.py', 'Vaccinium_bracteatum.contigs.fa', 'Vaccinium_bracteatum.dedup.bam', '12', '--correct_nrounds', '2', '--RE', 'GATC', '--skip_allhic']' returned non-zero exit status 1.
Best, Kun
Hi Kun,
Thanks! I have fixed this bug in the latest commit.
Best regards, Xiaofei
Hello xiao fei and kun, May I ask a question? I scaffolded my genome according to the method on GitHub,but it seems that there are some problems. Many areas with low HIC interaction signals are put together. What is the reason for this?
@ZJ-1992 is this the global heatmap? Would you please show the global (whole genome) heatmap?
@ZJ-1992 is this the global heatmap? Would you please show the global (whole genome) heatmap? it is a segment of the chromosome 1, I set "--correct_nrounds 2", but No assembly errors found. There is the global heatmap. And I want to ask you how to convert results from hicpro to bam format?
Hi @ZJ-1992,
The anchoring rate is controlled by several parameters including --min_RE_sites
, --min_links
, --min_link_density
, and --min_density_ratio
. Unfortunately, determining the thresholds of these parameters is still challenging until the heatmap is visualized. In your case, the contigs showing low Hi-C signals were assigned to a group based on your parameters. During contig ordering, these contigs shifted towards the ends of the group. If you believe that there are an insufficient number of Hi-C links to accurately position these contigs, you could move them to "debris" in Juicebox.
Best regards, Xiaofei
Hello,
xiaofei, I am sorry for my late reply. I can't determine if those very small fragments are correctly placed, so I can only put them at the end as a debris. Thank you for your suggestion!
***@***.***
---- Replied Message ----
From
Xiaofei ***@***.***>
Date
12/29/2023 10:22
To
***@***.***>
Cc
***@***.***>
,
***@***.***>
Subject
Re: [zengxiaofei/HapHiC] Perform not so good for haplotype mixed assembly (Issue #2)
Hi @ZJ-1992, The anchoring rate is controlled by several parameters including --min_RE_sites, --min_links, --min_link_density, --min_density_ratio. Unfortunately, determining the thresholds of these parameters is still challenging until the heatmap is visualized. In your case, the contigs showing low Hi-C signals were assigned to a group based on your parameters. During contig ordering, these contigs shifted towards the ends of the group. If you believe that there are an insufficient number of Hi-C links to accurately position these contigs, you could move them to "debris" in Juicebox. Best regards, Xiaofei
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
In the latest commit, I improved the stability of contig ordering and orientation by comparing the results of fast sorting with ALLHiC optimization and selecting the better one.
@ZJ-1992 is this the global heatmap? Would you please show the global (whole genome) heatmap? it is a segment of the chromosome 1, I set "--correct_nrounds 2", but No assembly errors found. There is the global heatmap. And I want to ask you how to convert results from hicpro to bam format?
Hello, I noticed in your HIC heatmap that some chromosome tails do not have HIC signals (indicated by the arrow). I also found this issue in one of my plant genomes, and the annotation found that this is a large tandem repeat sequence. I think this may be an assembly error of HIFIASM. How do you handle it in the future?
@zy66-su,
It's hard to tell whether these regions were correctly assembled or not using a Hi-C contact map. Hi-C reads that are mapped to such regions are often filtered out with the MAPQ > 0 criterion due to multiple mapping. You can verify these regions by using information from other data types or experiments, such as telomeric repeats, ultra-long read alignment, comparison with a reference genome, and the fluorescence in situ hybridization (FISH) experiment. If you are unsure whether these regions are correctly assembled, removing them to debris is also a acceptable choice.
You can also ask the author of hifiasm for more information: https://github.com/chhylp123/hifiasm/issues
Best wishes, Xiaofei
@zengxiaofei , Thanks for your reply. I happened to see the same problem in haphic and found that this situation is quite common. I think this problem can only be solved by FISH.
Sincerely, Yu Zhang
Hi,
I am trying to use HapHiC to anchor a haplotype-collapsed genome (hifiasm p_ctg), and I found that HapHiC perform not so good, especially contig ordering, HiC data was mapping following HiC-Pro 3 pipeline
here is HapHiC results heatmap
and here is EndHiC + manully curation results heatmap
If you need hic (or bam) and contig data for testing, I will upload to baidu netdisk and send you an e-mail (for tesing only).
Best, Kun