zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
141 stars 10 forks source link

Sequence lost after Juicebox curation #63

Closed lancelot-bio closed 2 months ago

lancelot-bio commented 2 months ago

Hi, After scaffolding, I followed the recommendation, executed the juicebox.sh, and used the out_JBAT.hic and out_JBAT.assembly to obtain a modified assembly in juicebox. However, when I ran the juicer post command, it posted an error: [E::make_asm_dict_from_agp] sequence ptg000032l not found I checked other files, 'ptg000032l' was presented in related files but the out_JBAT.review.assembly file. Is this error raised due to the changed name in out_JBAT.assembly?

Thanks

zengxiaofei commented 2 months ago

Please show me your commands.

lancelot-bio commented 2 months ago

Hi, for haphic: ~/01_Tools/HapHiC/haphic pipeline filtered.curated.fasta HiC.filtered.bam 10 --threads 88 --skip_allhic --correct_nrounds 2

for juicer.sh: ln -s ~/01_PrimaryContig/test_Haphic/filtered.curated.fasta . samtools faidx filtered.curated.fasta ~/01_Tools/HapHiC/scripts/../utils/juicer pre -a -q 1 -o out_JBAT ~/01_PrimaryContig/test_Haphic/HiC.filtered.bam scaffolds.raw.agp filtered.curated.fasta.fai >out_JBAT.log 2>&1 (java -jar -Xmx32G ~/01_Tools/HapHiC/scripts/../utils/juicer_tools.1.9.9_jcuda.0.8.jar pre out_JBAT.txt out_JBAT.hic.part <(cat out_JBAT.log | grep PRE_C_SIZE | awk '{print $2" "$3}')) && (mv out_JBAT.hic.part out_JBAT.hic)

for juicer post: ~/01_Tools/HapHiC/utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp scaffolds.fa

Thanks.

zengxiaofei commented 2 months ago

The command for juicer post should be:

~/01_Tools/HapHiC/utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp filtered.curated.fasta
lancelot-bio commented 2 months ago

Hi, Ye, I found the problem, and it posted another error: [E::assembly_to_agp] coordinates conversion error ptg000032l 0 Here is my agp and assembly file: out_JBAT.zip Thanks.

zengxiaofei commented 2 months ago

The contig IDs in the assembly file should start with ctg (contig names converted by juicer pre). However, the first 18 contig IDs start with ptg (original contig names). Did you modify the assembly file yourself?

lancelot-bio commented 2 months ago

Hi, Yes, when I realized the problem may related to the name, I tried to modify them, and, eventually, I forgot to reverse these actions.

Many thanks.

Isoris commented 1 month ago

I woud like to ask you too because I have generated the files with the --quick_mode and so It seems I encountered the same error as you.

Do you think that I could yahs/agp_to_fasta.py into juicer post -a ?

Thank you for the answer. Quentin

Isoris commented 1 month ago

It seems to be related to how we load the assemblies in JBAT because my scaffolds went from ctg to group1 because I have used a custom script to get the green contig bins maybe the other people did too or did you solve it ?

https://github.com/ShunOuchi/GreenHill/blob/main/utils/fasta_to_juicebox_assembly.py

Isoris commented 1 month ago

I solved it using the script in the juicebox_scripts/ folder. Thanks again

bl24 commented 2 weeks ago

@Isoris Hi, I'm running into this same issue. What script did you end up using?

Isoris commented 2 weeks ago

So what your have to do is to go into the yahs directory and there will be a ditectory juicebox scripts and inside another juicebox scripts

Then in this directory there is like generate fasta from assembly or something like that. You need to first generate the agp file of your sequence. then the contig names should match the assembly file contig names and then you can generate the fasta... Let me search it again for you. I don't remember precisely...

On Thu, Nov 7, 2024, 3:58 AM bl24 @.***> wrote:

@Isoris https://github.com/Isoris Hi, I'm running into this same issue. What script did you end up using?

— Reply to this email directly, view it on GitHub https://github.com/zengxiaofei/HapHiC/issues/63#issuecomment-2460764728, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TG4OZZJ7BLDZZQ6RF3Z7J7F3AVCNFSM6AAAAABN54YSU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRQG43DINZSHA . You are receiving this because you were mentioned.Message ID: @.***>

Isoris commented 2 weeks ago

But that worked because I renamed the contigs of the assembly file of juicebox manually with sed prior to using the convert to fasta script. And I could guess the names of the scaffold from their order because I didn't move them (reorder them like 1 2 3 4) but I kept the order of the Blue scaffold intact as in the initial ( non reviewed) assembly file. This way I could rename the reviewed file with the names in ghe AGP file. And use both to make the fasta.

bl24 commented 2 weeks ago

Are you referring to juicebox_assembly_converter.py? If found that from the juicebox_scripts repository: https://github.com/phasegenomics/juicebox_scripts

Isoris commented 2 weeks ago

Yes !! This one. Does it work?

On Fri, Nov 8, 2024, 2:12 AM bl24 @.***> wrote:

Are you referring to juicebox_assembly_converter.py? If found that from the juicebox_scripts repository: https://github.com/phasegenomics/juicebox_scripts

— Reply to this email directly, view it on GitHub https://github.com/zengxiaofei/HapHiC/issues/63#issuecomment-2463020797, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TBKUB5HGKWYSKP7WBDZ7O3RXAVCNFSM6AAAAABN54YSU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRTGAZDANZZG4 . You are receiving this because you were mentioned.Message ID: @.***>

Isoris commented 2 weeks ago

You need to use the .review file from juicebox as assembly for the script.

On Fri, Nov 8, 2024, 2:23 AM Quentin Andrès @.***> wrote:

Yes !! This one. Does it work?

On Fri, Nov 8, 2024, 2:12 AM bl24 @.***> wrote:

Are you referring to juicebox_assembly_converter.py? If found that from the juicebox_scripts repository: https://github.com/phasegenomics/juicebox_scripts

— Reply to this email directly, view it on GitHub https://github.com/zengxiaofei/HapHiC/issues/63#issuecomment-2463020797, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TBKUB5HGKWYSKP7WBDZ7O3RXAVCNFSM6AAAAABN54YSU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRTGAZDANZZG4 . You are receiving this because you were mentioned.Message ID: @.***>

bl24 commented 2 weeks ago

I did try it but I get this error:

Checking for breaks listed in .assembly and making them... Traceback (most recent call last): File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 872, in processed_assembly = JuiceboxConverter().process(fasta, assembly, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 91, in process sequences = self._add_breaks(sequences, assembly_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 300, in _add_breaks raise BadContigNameError("Unbroken contig {} failed to map!!".format(fragment_name)) BadContigNameError: Unbroken contig ctg00000001.1 failed to map!!

Isoris commented 2 weeks ago

Do you have any debris in your assembly? Or fragments? Something like :::debris

It is because this script cannot accommodate with small debris you need maybe to copy your assembly file and to delete lines with debris

On Fri, Nov 8, 2024, 3:14 AM bl24 @.***> wrote:

I did try it but I get this error:

Checking for breaks listed in .assembly and making them... Traceback (most recent call last): File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 872, in processed_assembly = JuiceboxConverter().process(fasta, assembly, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 91, in process sequences = self._add_breaks(sequences, assembly_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 300, in _add_breaks raise BadContigNameError("Unbroken contig {} failed to map!!".format(fragment_name)) BadContigNameError: Unbroken contig ctg00000001.1 failed to map!!

— Reply to this email directly, view it on GitHub https://github.com/zengxiaofei/HapHiC/issues/63#issuecomment-2463126648, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TEJ7V7OXCUDNEFZ6YDZ7PC2NAVCNFSM6AAAAABN54YSU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRTGEZDMNRUHA . You are receiving this because you were mentioned.Message ID: @.***>

Isoris commented 2 weeks ago

The names of your contigs in your fasta should be the same as in Your assembly file

On Fri, Nov 8, 2024, 3:14 AM bl24 @.***> wrote:

I did try it but I get this error:

Checking for breaks listed in .assembly and making them... Traceback (most recent call last): File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 872, in processed_assembly = JuiceboxConverter().process(fasta, assembly, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 91, in process sequences = self._add_breaks(sequences, assembly_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 300, in _add_breaks raise BadContigNameError("Unbroken contig {} failed to map!!".format(fragment_name)) BadContigNameError: Unbroken contig ctg00000001.1 failed to map!!

— Reply to this email directly, view it on GitHub https://github.com/zengxiaofei/HapHiC/issues/63#issuecomment-2463126648, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TEJ7V7OXCUDNEFZ6YDZ7PC2NAVCNFSM6AAAAABN54YSU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRTGEZDMNRUHA . You are receiving this because you were mentioned.Message ID: @.***>

Isoris commented 2 weeks ago

Compare the initial assembly file ( before juicebox) to the final review file. Are the names matching ? Next are the review file contig names the same as the review assembly file ?

Next, remove lines in the final review file that have debris on it. And lines that end with 0 change to 1 to indicate its 1 nucleotide long for the contig and not 0 ( sometimes Juicebox can bug and when you cut at a certain zoom and have the camera locked it will cut the contig and generate a small contig that has a length of 0... )

On Fri, Nov 8, 2024, 3:18 AM Quentin Andrès @.***> wrote:

The names of your contigs in your fasta should be the same as in Your assembly file

On Fri, Nov 8, 2024, 3:14 AM bl24 @.***> wrote:

I did try it but I get this error:

Checking for breaks listed in .assembly and making them... Traceback (most recent call last): File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 872, in processed_assembly = JuiceboxConverter().process(fasta, assembly, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 91, in process sequences = self._add_breaks(sequences, assembly_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 300, in _add_breaks raise BadContigNameError("Unbroken contig {} failed to map!!".format(fragment_name)) BadContigNameError: Unbroken contig ctg00000001.1 failed to map!!

— Reply to this email directly, view it on GitHub https://github.com/zengxiaofei/HapHiC/issues/63#issuecomment-2463126648, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TEJ7V7OXCUDNEFZ6YDZ7PC2NAVCNFSM6AAAAABN54YSU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRTGEZDMNRUHA . You are receiving this because you were mentioned.Message ID: @.***>

bl24 commented 2 weeks ago

Okay, I removed the debris but then it gave me a different error: MissingFragmentError: Assembly out_JBAT.review2.assembly is missing the sequence for index 19

I'm confused about the contigs in my fasta needing to be the same as the assembly. Isn't the purpose of the liftover.agp file to have a conversion between the assembly and the fasta? I ask because I've able to make juicer post work on other assemblies that didn't use quick_mode.

Isoris commented 2 weeks ago

Put nack the debris lines and the lines that you removed and make sure that the contig names in the scaffold.assembly ( before juicebox ) are the same as the .review.assembly

And the same as the fasta

Yes there's a bug that breaks the review file when we add fragments let me check my bash History

On Fri, Nov 8, 2024, 3:36 AM bl24 @.***> wrote:

Okay, I removed the debris but then it gave me a different error: MissingFragmentError: Assembly out_JBAT.review2.assembly is missing the sequence for index 19

I'm confused about the contigs in my fasta needing to be the same as the assembly. Isn't the purpose of the liftover.agp file to have a conversion between the assembly and the fasta? I ask because I've able to make juicer post work on other assemblies that didn't use quick_mode.

— Reply to this email directly, view it on GitHub https://github.com/zengxiaofei/HapHiC/issues/63#issuecomment-2463164262, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TF4F3SPTPFHDEZAK73Z7PFNVAVCNFSM6AAAAABN54YSU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRTGE3DIMRWGI . You are receiving this because you were mentioned.Message ID: @.***>

bl24 commented 2 weeks ago

Comparing my original assembly with my review assembly and I noticed there are also things labeled "fragments" in my review file

ctg00000018.1:::fragment_1 18 2575032 ctg00000018.1:::fragment_2:::debris 19 775000 ctg00000018.1:::fragment_3 20 424305 ctg00000019.1:::fragment_1 21 237691 ctg00000019.1:::fragment_2:::debris 22 375000 ctg00000019.1:::fragment_3 23 7350695

Isoris commented 2 weeks ago

I don't understand why you have debris ? In my opinion you can maybe try to remove chromosome boundaries and then move to debris?

Like in JBAT the debris are the unwanted contigs. Usually you remove the chromosome boundaries ?

The convert script cannot accomodate with the :::debris I think.

Thats why I cannot really cut and rebuild thr fasta it's like a bug?

On Fri, Nov 8, 2024, 3:44 AM bl24 @.***> wrote:

Comparing my original assembly with my review assembly and I noticed there are also things labeled "fragments" in my review file

ctg00000018.1:::fragment_1 18 2575032 ctg00000018.1:::fragment_2:::debris 19 775000 ctg00000018.1:::fragment_3 20 424305 ctg00000019.1:::fragment_1 21 237691 ctg00000019.1:::fragment_2:::debris 22 375000 ctg00000019.1:::fragment_3 23 7350695

— Reply to this email directly, view it on GitHub https://github.com/zengxiaofei/HapHiC/issues/63#issuecomment-2463175874, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TAUZCMMWDXJX6MHW53Z7PGKDAVCNFSM6AAAAABN54YSU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRTGE3TKOBXGQ . You are receiving this because you were mentioned.Message ID: @.***>

Isoris commented 2 weeks ago

But that worked because I renamed the contigs of the assembly file of juicebox manually with sed prior to using the convert to fasta script. And I could guess the names of the scaffold from their order because I didn't move them (reorder them like 1 2 3 4) but I kept the order of the Blue scaffold intact as in the initial ( non reviewed) assembly file. This way I could rename the reviewed file with the names in ghe AGP file. And use both to make the fasta.

Isoris commented 2 weeks ago

Here is what my .review file looks like

IMG_20241108_035515.jpg

IMG_20241108_035529.jpg

Isoris commented 2 weeks ago

Comparing my original assembly with my review assembly and I noticed there are also things labeled "fragments" in my review file

ctg00000018.1:::fragment_1 18 2575032 ctg00000018.1:::fragment_2:::debris 19 775000 ctg00000018.1:::fragment_3 20 424305 ctg00000019.1:::fragment_1 21 237691 ctg00000019.1:::fragment_2:::debris 22 375000 ctg00000019.1:::fragment_3 23 7350695

The fragments in the review file are normal they are the green contigs squares in Juicebox. While the contigs ( on the left ) are the blue scaffolds in juicebox.

Your file seems normal. And what about your fasta ?

Can you grep the header names?

Isoris commented 2 weeks ago

It's probably a matter of names or maybe it's related to the .1 ? After ctg ? The fasta file from which the .assembly file originates needs to be the one used in the script. Or maybe becauae your first contig has a problem or no fragment or ? Can you please show the top of your .review file?

Also maybe you need to change the file to

>ctg instead of ctg

I did try it but I get this error:

Checking for breaks listed in .assembly and making them... Traceback (most recent call last): File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 872, in processed_assembly = JuiceboxConverter().process(fasta, assembly, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 91, in process sequences = self._add_breaks(sequences, assembly_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wing2/users/bmlau/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py", line 300, in _add_breaks raise BadContigNameError("Unbroken contig {} failed to map!!".format(fragment_name)) BadContigNameError: Unbroken contig ctg00000001.1 failed to map!!

bl24 commented 2 weeks ago

Okay, I realize the reason why I was getting errors is because I was using a different fasta file. Instead of using the original fasta input for the HapHic pipeline I used the corrected_asm.fa for quick_view mode after the pipeline gave me the message:

"It seems that some chromosomes were grouped together (length ratio = 0.5) You could check whether the parameters used are correct / appropriate and then try to tune the parameters for assembly correction, contig / Hi-C link filtration, or Markov clustering"