zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
141 stars 10 forks source link

Perform not so good for haplotype mixed assembly #2

Closed xiekunwhy closed 6 months ago

xiekunwhy commented 12 months ago

Hi,

I am trying to use HapHiC to anchor a haplotype-collapsed genome (hifiasm p_ctg), and I found that HapHiC perform not so good, especially contig ordering, HiC data was mapping following HiC-Pro 3 pipeline

here is HapHiC results heatmap image

and here is EndHiC + manully curation results heatmap image

If you need hic (or bam) and contig data for testing, I will upload to baidu netdisk and send you an e-mail (for tesing only).

Best, Kun

zengxiaofei commented 12 months ago

Hi Kun,

Thanks for your feedback! Could you please show me your command lines and paste the log files output by HapHiC? This information will be of great help.

Best regards, Xiaofei

xiekunwhy commented 12 months ago

Hi,

Here are command lines and logs HapHic.zip

Best, Kun

zengxiaofei commented 12 months ago

Hi, Kun

There's nothing wrong with your commands.

You could try using the parameter --skip_allhic in haphic sort. It will help me figure out whether the problem occurred during ALLHiC optimization. I have observed that in rare cases, optimization may introduce large-scale ordering and orientation errors.

Furthermore, could you please visualize your scaffolding result in Juicebox? I need to check whether there are misjoins in the contigs of the assembly.

Best regards, Xiaofei

xiekunwhy commented 12 months ago

Hi, Xiaofei

the perfomance is much better than default when adding --skip_allhic parameter, here is haphic sort --skip_allhic results heatmap (group5 is butterfly miss-connected?), image

the errors between group1 and group4 is caused by contig errors (ptg000002l and ptg000030l), here is contig heatmap, preparing Juicebox is time consuming, so I just plot contig heatmap here (>1Mb contig was kept). zja contig1mb

Best, Kun

xiekunwhy commented 12 months ago

Here is the agp file (haphic sort --skip_allhic) scaffolds.agp.zip

zengxiaofei commented 12 months ago

Hi Kun,

I'm glad to see that your result has improved significantly. I also have a plan to fix the issue raised by ALLHiC.

In group5, there could be a misconnection between telomeres. This problem may also be improved in subsequent updates.

To correct the misjoins in contigs, you can try using the parameter --correct_nrounds 2 in haphic cluster, or manually curating the result in Juicebox. The assembly correction algorithm implemented in HapHiC is more stringent than other scaffolding tools, so it should not significantly affect the contig contiguity of your assembly.

Best regards, Xiaofei

xiekunwhy commented 12 months ago

Hi, Xiaofei

Glad to hear that and hope new version comming soon.

Thank you for your suggestions, I will try it.

Best, Kun

xiekunwhy commented 12 months ago

Hi, Xiaofei

Just a suggestion.

Would you please don't change contig names in final agp file when running pipeline with --correct_nrounds (>0) options, use original coordinates is more convenience to other to lift things (for example, gene annotation gff) from contig coordinates to the final scaffold coordinates.

HapHiC use *_breakn as new contig name, and the coordinates of all break pieces are re-counting from 1, users need to write a script to covert names and coordinates back, some one can not write such script well, I think.

YaHS keep original contig names and coordinates in final results, if I remember correctly.

Best, Kun

zengxiaofei commented 12 months ago

Hi Kun,

Thanks for your suggestion!

I will improve it in the subsequent updates.

Best regards, Xiaofei

zengxiaofei commented 11 months ago

Hi, Xiaofei

Just a suggestion.

Would you please don't change contig names in final agp file when running pipeline with --correct_nrounds (>0) options, use original coordinates is more convenience to other to lift things (for example, gene annotation gff) from contig coordinates to the final scaffold coordinates.

HapHiC use *_breakn as new contig name, and the coordinates of all break pieces are re-counting from 1, users need to write a script to covert names and coordinates back, some one can not write such script well, I think.

YaHS keep original contig names and coordinates in final results, if I remember correctly.

Best, Kun

Hi Kun,

I have addressed this issue in the latest version of HapHiC (version 1.0.1). For more details, please read the README.md.

I will start to handle the problem regarding the contig ordering and orientation. But it still needs some time for me.

Best regards, Xiaofei

xiekunwhy commented 11 months ago

Hi Xiaofei,

Thank you very much, I will try it.

Best, Kun

zengxiaofei commented 11 months ago

In the latest commit of HapHiC, I have fixed a bug which can lead to a Segmentation fault in running juicebox.sh.

Xiaofei

xiekunwhy commented 11 months ago

Hi,

sklearn in https://github.com/zengxiaofei/HapHiC/blob/main/conda_env/create_conda_env_py310.sh means scikit-learn, right? I got error log when I want to re-install a newer version. The errors like following

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple/ sklearn
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/
Collecting sklearn
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/46/1c/395a83ee7b2d2ad7a05b453872053d41449564477c81dc356f720b16eac4/sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.

      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error

      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-package
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Best, Kun

zengxiaofei commented 11 months ago

Yes, to install a newer version of sklearn, you should pip3 install scikit-learn, just like what we did in https://github.com/zengxiaofei/HapHiC/blob/main/conda_env/create_conda_env_py311.sh:

conda create -n haphic_py311 python=3.11.2
conda activate haphic_py311

pip3 install numpy scipy matplotlib
pip3 install scikit-learn networkx pysam

conda install -c intel mkl
conda config --add channels conda-forge
conda install sparse_dot_mkl

pip3 install portion

conda env export > environment_py311.yml

Best regards, Xiaofei

xiekunwhy commented 11 months ago

Hi,

Just a bug, haphic pipeline with --correct_nrounds > 0, if there is no contig need to be corrected, corrected_ctgs.txt will not be written, but 04.build step still reading corrected_ctgs.txt file.

2023-12-15 18:53:51 <HapHiC_sort.py> [fast_sort] [group7_48742957bp] Checking the content of input group file...
2023-12-15 18:53:51 <HapHiC_sort.py> [fast_sort] [group7_48742957bp] Starting fast sorting iterations...
2023-12-15 18:53:51 <HapHiC_sort.py> [run] Program finished in 8.638415098190308s
2023-12-15 18:53:51 <HapHiC_pipeline.py> [haphic_build] Step4: Build final scaffolds (pseudomolecules)...
2023-12-15 18:53:51 <HapHiC_build.py> [run] Program started, HapHiC version: 1.0.2 (update: 2023.12.15)
2023-12-15 18:53:51 <HapHiC_build.py> [run] Python version: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0]
2023-12-15 18:53:51 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2023-12-15 18:54:03 <HapHiC_build.py> [parse_tours] Parsing tour files...
Traceback (most recent call last):
  File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_pipeline.py", line 483, in <module>
    main()
  File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_pipeline.py", line 476, in main
    haphic_build(args)
  File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_pipeline.py", line 437, in haphic_build
    HapHiC_build.run(args, log_file=LOG_FILE)
  File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_build.py", line 267, in run
    corrected_ctg_set = parse_corrected_ctgs(args.corrected_ctgs)
  File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_build.py", line 65, in parse_corrected_ctgs
    with open(corrected_ctgs) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/state/partition1/WORK/Bio/Project/xiek/pipe/hic/GDDB22060177/haphic_ecc/01.cluster/corrected_ctgs.txt'
Traceback (most recent call last):
  File "/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/haphic", line 96, in <module>
    subprocess.run(commands, check=True)
  File "/Bio/User/kxie/software/miniforge3/envs/haphic/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/Bio/User/kxie/pipeline/genome/anchor/apps/HapHiC/scripts/HapHiC_pipeline.py', 'Vaccinium_bracteatum.contigs.fa', 'Vaccinium_bracteatum.dedup.bam', '12', '--correct_nrounds', '2', '--RE', 'GATC', '--skip_allhic']' returned non-zero exit status 1.

Best, Kun

zengxiaofei commented 11 months ago

Hi Kun,

Thanks! I have fixed this bug in the latest commit.

Best regards, Xiaofei

ZJ-1992 commented 11 months ago

Hello xiao fei and kun, May I ask a question? I scaffolded my genome according to the method on GitHub,but it seems that there are some problems. Many areas with low HIC interaction signals are put together. What is the reason for this? 微信图片_20231228135358

xiekunwhy commented 11 months ago

@ZJ-1992 is this the global heatmap? Would you please show the global (whole genome) heatmap?

ZJ-1992 commented 11 months ago

@ZJ-1992 is this the global heatmap? Would you please show the global (whole genome) heatmap? it is a segment of the chromosome 1, I set "--correct_nrounds 2", but No assembly errors found. There is the global heatmap. And I want to ask you how to convert results from hicpro to bam format? global 1 2 dep

zengxiaofei commented 11 months ago

Hi @ZJ-1992,

The anchoring rate is controlled by several parameters including --min_RE_sites, --min_links, --min_link_density, and --min_density_ratio. Unfortunately, determining the thresholds of these parameters is still challenging until the heatmap is visualized. In your case, the contigs showing low Hi-C signals were assigned to a group based on your parameters. During contig ordering, these contigs shifted towards the ends of the group. If you believe that there are an insufficient number of Hi-C links to accurately position these contigs, you could move them to "debris" in Juicebox.

Best regards, Xiaofei

ZJ-1992 commented 10 months ago
        Hello,

xiaofei, I am sorry for my late reply. I can't determine if those very small fragments are correctly placed, so I can only put them at the end as a debris. Thank you for your suggestion!

                    ***@***.***

---- Replied Message ----

     From 

        Xiaofei ***@***.***>

     Date 

    12/29/2023 10:22

     To 

        ***@***.***>

     Cc 

        ***@***.***>
        ,

        ***@***.***>

     Subject 

          Re: [zengxiaofei/HapHiC] Perform not so good for haplotype mixed assembly (Issue #2)

Hi @ZJ-1992, The anchoring rate is controlled by several parameters including --min_RE_sites, --min_links, --min_link_density, --min_density_ratio. Unfortunately, determining the thresholds of these parameters is still challenging until the heatmap is visualized. In your case, the contigs showing low Hi-C signals were assigned to a group based on your parameters. During contig ordering, these contigs shifted towards the ends of the group. If you believe that there are an insufficient number of Hi-C links to accurately position these contigs, you could move them to "debris" in Juicebox. Best regards, Xiaofei

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

zengxiaofei commented 4 months ago

In the latest commit, I improved the stability of contig ordering and orientation by comparing the results of fast sorting with ALLHiC optimization and selecting the better one.

zy66-su commented 4 months ago

@ZJ-1992 is this the global heatmap? Would you please show the global (whole genome) heatmap? it is a segment of the chromosome 1, I set "--correct_nrounds 2", but No assembly errors found. There is the global heatmap. And I want to ask you how to convert results from hicpro to bam format? global 1 2 dep

Hello, I noticed in your HIC heatmap that some chromosome tails do not have HIC signals (indicated by the arrow). I also found this issue in one of my plant genomes, and the annotation found that this is a large tandem repeat sequence. I think this may be an assembly error of HIFIASM. How do you handle it in the future?
image

zengxiaofei commented 4 months ago

@zy66-su,

It's hard to tell whether these regions were correctly assembled or not using a Hi-C contact map. Hi-C reads that are mapped to such regions are often filtered out with the MAPQ > 0 criterion due to multiple mapping. You can verify these regions by using information from other data types or experiments, such as telomeric repeats, ultra-long read alignment, comparison with a reference genome, and the fluorescence in situ hybridization (FISH) experiment. If you are unsure whether these regions are correctly assembled, removing them to debris is also a acceptable choice.

You can also ask the author of hifiasm for more information: https://github.com/chhylp123/hifiasm/issues

Best wishes, Xiaofei

zy66-su commented 4 months ago

@zengxiaofei , Thanks for your reply. I happened to see the same problem in haphic and found that this situation is quite common. I think this problem can only be solved by FISH.

Sincerely, Yu Zhang