GFA file size issue - Githubissues

alienzj commented 3 years ago

Hi, I use hifiasm-meta to assemble urogenital tract metagenomics data from CAMI.

This data was simulated by CAMISIM, average read length: 3,000 bp, read length s.d.: 1,000 bp.

Run log:

$ hifiasm_meta -o cami_0.hifiasm_meta.out -t 32 /database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz

[M::hamt_assemble] Skipped read selection.
[M::ha_analyze_count] lowest: count[16383] = 0
[M::hamt_ft_gen::278.101*4.89@16.432GB] ==> filtered out 0 k-mers occurring 750 or more times
[M::hamt_assemble] Generated flt tab.
alloc 1666925 uint16_t
[M::ha_pt_gen::398.464*4.70] ==> counted 131777689 distinct minimizer k-mers
[M::ha_pt_gen] count[16383] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[16383] = 0
tot_cnt=59765
tot_pos=59765
[M::ha_pt_gen::431.595*5.13] ==> indexed 59765 positions
[M::hamt_assemble::439.470*5.61@16.432GB] ==> corrected reads for round 1
[M::hamt_assemble] # bases: 4957619989; # corrected bases: 0; # recorrected bases: 0
[M::hamt_assemble] size of buffer: 0.132GB
[M::ha_pt_gen::470.852*6.04] ==> counted 131777979 distinct minimizer k-mers
[M::ha_pt_gen] count[16383] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[16383] = 0
tot_cnt=59765
tot_pos=59765
[M::ha_pt_gen::506.590*6.28] ==> indexed 59765 positions
[M::hamt_assemble::514.866*6.69@16.432GB] ==> corrected reads for round 2
[M::hamt_assemble] # bases: 4957619989; # corrected bases: 0; # recorrected bases: 0
[M::hamt_assemble] size of buffer: 0.132GB
[M::ha_pt_gen::559.852*6.81] ==> counted 131777979 distinct minimizer k-mers
[M::ha_pt_gen] count[16383] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[16383] = 0
tot_cnt=59765
tot_pos=59765
[M::ha_pt_gen::597.090*6.98] ==> indexed 59765 positions
[M::hamt_assemble::606.630*7.37@16.432GB] ==> corrected reads for round 3
[M::hamt_assemble] # bases: 4957619989; # corrected bases: 0; # recorrected bases: 0
[M::hamt_assemble] size of buffer: 0.132GB
[M::ha_pt_gen::643.258*7.55] ==> counted 131777979 distinct minimizer k-mers
[M::ha_pt_gen] count[16383] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[16383] = 0
tot_cnt=59765
tot_pos=59765
[M::ha_pt_gen::674.827*7.68] ==> indexed 59765 positions
[M::hamt_assemble::683.525*7.98@16.432GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
[M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0
[M::hist_readlength] <1.0k:
[M::hist_readlength] 1.0k: ]]]]]]]]
[M::hist_readlength] 1.5k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 2.0k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 2.5k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 3.0k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 3.5k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]                                                                                                                                                                                    
[M::hist_readlength] 4.0k: ]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 4.5k: ]]]]]]]]]]]]]
[M::hist_readlength] 5.0k: ]]]]]]]
[M::hist_readlength] 5.5k: ]]]]
[M::hist_readlength] 6.0k: ]]
[M::hist_readlength] 6.5k: ]
[M::hist_readlength] 7.0k: ]
[M::hist_readlength] 7.5k: ]
[M::hist_readlength] 8.0k: ]
[M::hist_readlength] 8.5k: ]
[M::hist_readlength] 9.0k: ]
[M::hist_readlength] 9.5k: ]
[M::hist_readlength] 10.0k: ]
[M::hist_readlength] 10.5k: ]
[M::hist_readlength] 11.0k: ]
[M::hist_readlength] 11.5k: ]
[M::hist_readlength] >50.0k: 0
Writing reads to disk...
wrote cmd of length 323: version=0.13-r308, CMD= hifiasm_meta -o cami_0.hifiasm_meta.out -t 32 /database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
Bin file was created on Wed Dec 30 15:31:02 2020
Hifiasm_meta 0.1-r022 (hifiasm code base 0.13-r308).
Reads has been written.
[hamt::write_All_reads] Writing per-read coverage info...
[hamt::write_All_reads] Finished writing.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
[M::hamt_output_unitig_graph_advance] Writing GFA...
[M::hamt_output_unitig_graph_advance] Writing GFA...
[M::hamt_output_unitig_graph_advance] Writing GFA...
Inconsistency threshold for low-quality regions in BED files: 70%
Writing debug asg to disk...
[M::write_debug_assembly_graph] took 0.02s

[M::main] Hifiasm code base version: 0.13-r308
[M::main] Hifiasm_meta version: 0.1-r022
[M::main] CMD: hifiasm_meta -o cami_0.hifiasm_meta.out -t 32 /database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
[M::main] Real time: 691.048 sec; CPU: 5463.747 sec; Peak RSS: 16.432 GB

Output:

$ ll
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.a_ctg.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.a_ctg.noseq.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.dbg_asg
.rw-r--r-- zhujie 2782  1.2 GB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ec.bin
.rw-r--r-- zhujie 2782 38.2 MB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ec.mt.bin
.rw-r--r-- zhujie 2782  6.7 MB Wed Dec 30 15:31:00 2020 cami_0.hifiasm_meta.out.ovecinfo.bin
.rw-r--r-- zhujie 2782  9.5 MB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ovlp.reverse.bin
.rw-r--r-- zhujie 2782  9.5 MB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ovlp.source.bin
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_ctg.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_ctg.noseq.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_utg.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_utg.noseq.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.r_utg.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.r_utg.lowQ.bed
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.r_utg.noseq.gfa

All GFA file size is zero.

Any help? Thanks ~

simroux commented 3 years ago

I see the same behavior with a "real" (i.e. non-simulated) metagenomics PacBio library. All the "*.bin" files seem to be correctly written, but all the other files are empty. I also see the same absence of overlaps:

[M::hamt_assemble::4567.208*11.17@17.077GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0

But other assemblers are able to generate contigs from the same library, so I would expect some overlaps to be found ?

xfengnefx commented 3 years ago

@alienzj Sorry for the late reply. It's weird that you didn't get "ha_hist_line" lines (just did a fresh clone + test run, which was sane). This step should be identical to the stable hifiasm (since no read selection was involved), and local test runs were all sane, I'm not sure where did it go wrong. Could you try assemble with the stable hifiasm and see if the overlap states look right?

xfengnefx commented 3 years ago

@simroux Sorry for the late reply. What's the full STDERR output? Is there "ha_hist_line" lines? If there's none, could you try assemble with the stable hifiasm? If hifiasm prints out ha_hist_line lines and reports overlaps, then it's my bug. I think the issue you encountered is similar to @alienzj 's, but I'm not sure why yet, since local test runs have been alright...

simroux commented 3 years ago

@xfengnefx : I do see the "ha_hist_line" lines, but not sure if they reflect overlap or not. I attached the full log, hopefully it will help clarify what happens ! asm.log

xfengnefx commented 3 years ago

@simroux Thanks for the log, then probably it's a different problem than the main post of this thread. The read length distribution looks unusual. Is this a pacbio hifi dataset, or hifi mixed with other libraries? I think we usually expect hifi reads to fall within the 5kb~20kb range, 1kb seems very short.

simroux commented 3 years ago

@xfengnefx Pretty sure these are HiFi reads (will get confirmation tomorrow). We typically try to aim the ~ 5-20k range, but real environmental samples being what they are, we often end up with a fair number of (very) short reads.

xfengnefx commented 3 years ago

@simroux Thank you, I appreciate it!

Anyway, it's weird to see no overlaps at all. What's the file size of *ovecinfo.bin? This file logs all overlaps ever calculated, regardless of whether they were threw away later on. We've seen hifiasm_meta failed on two real private samples, but that was fragmented contigs, not no overlaps. And another assembler also failed those samples.

I'm not sure why, let me discuss this with others tomorrow and see what they think.

simroux commented 3 years ago

@xfengnefx bin files seem relatively small to me: 1.3G Dec 29 19:55 asm.ec.bin 41M Dec 29 19:55 asm.ec.mt.bin 21M Dec 29 19:55 asm.ovecinfo.bin 11M Dec 29 19:55 asm.ovlp.reverse.bin 11M Dec 29 19:55 asm.ovlp.source.bin

We're not in a rush, but curious to know if this is an issue with our input file.

alienzj commented 3 years ago

@alienzj Sorry for the late reply. It's weird that you didn't get "ha_hist_line" lines (just did a fresh clone + test run, which was sane). This step should be identical to the stable hifiasm (since no read selection was involved), and local test runs were all sane, I'm not sure where did it go wrong. Could you try assemble with the stable hifiasm and see if the overlap states look right?

Hi, I test CAMI data (no Hifi) with stable hifiasm (not hifiasm_meta).

$ hifiasm -o test_hifiasm/cami_0.hifiasm.out -t 32 /zfssz3/pub/database/openstack.cebitec.uni-bielefeld.de/swift/v1/C
AMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz

[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_ft_gen::483.745*3.69@16.369GB] ==> filtered out 1384687 k-mers occurring -5 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::662.355*4.53] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::713.792*5.53] ==> indexed 0 positions
[M::ha_assemble::742.572*6.55@16.369GB] ==> corrected reads for round 1
[M::ha_assemble] # bases: 5000000696; # corrected bases: 0; # recorrected bases: 0
[M::ha_assemble] size of buffer: 0.131GB
[M::ha_pt_gen::797.425*7.35] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::849.905*8.01] ==> indexed 0 positions
[M::ha_assemble::880.619*8.84@16.369GB] ==> corrected reads for round 2
[M::ha_assemble] # bases: 5000000696; # corrected bases: 0; # recorrected bases: 0
[M::ha_assemble] size of buffer: 0.131GB
[M::ha_pt_gen::934.191*9.41] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::984.778*9.89] ==> indexed 0 positions
[M::ha_assemble::1012.330*10.48@16.369GB] ==> corrected reads for round 3
[M::ha_assemble] # bases: 5000000696; # corrected bases: 0; # recorrected bases: 0
[M::ha_assemble] size of buffer: 0.131GB
[M::ha_pt_gen::1066.756*10.89] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::1117.322*11.24] ==> indexed 0 positions
[M::ha_assemble::1143.542*11.71@16.369GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
[M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] purge duplication coverage threshold: -1
[M::purge_dups] purge duplication coverage threshold: -1
[M::adjust_utg_by_primary] primary contig coverage range: [0, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
[M::main] Version: 0.13-r307
[M::main] CMD: hifiasm -o test_hifiasm/cami_0.hifiasm.out -t 32 /zfssz3/pub/database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
[M::main] Real time: 1154.208 sec; CPU: 13399.291 sec; Peak RSS: 16.369 GB

There are no "ha_hist_line" lines I can see. Maybe non-hifi reads do not meet the requirements of hifiasm.

Thanks.

xfengnefx commented 3 years ago

@alienzj Thanks for the confirmation. Looks like it's hifiasm's behavior.

simroux commented 3 years ago

@xfengnefx : Quick update: after I double checked, it turned out that my previous input file was a mix of ccs and non-ccs reads. I have now re-run hifiasm-meta on the same data but with only the ccs reads, and this gave me a nice assembly (none of the complete circular chromosome sadly, but definitely on par or better compared to our previous attempt with the same reads). So this was a user problem, i.e. I should have read more carefully that the input really must be ccs reads, not any PacBio reads :-) Thanks for the quick replies and the help with troubleshooting my mistake !

xfengnefx commented 3 years ago

@simroux great, glad to know it's not an unknown bug on my side, thank you for the confirmation! I guess if non-ccs reads are longer than ccs, hifiasm's containment removal step might happen to favor the longer non-ccs ones (and overlaps between them are then discarded because of quality). The meta fork has heuristics for hifi low coverage + het corner cases, but probably can't help with non-ccs reads. It's just my guessing...

For the assembly performance, does the dataset expect low coverages? Is it mostly plasmids or very short circular genomes? We saw one case of low coverage and one case of shared sequence causing troubles in the mock datasets, and I'm currently fixing plasmids. The heuristic needs more test data to improve, however. We definitely appreciate it if you are happy to share data for development. Thanks!

simroux commented 3 years ago

@xfengnefx I think it's mostly low-coverage and some strain variation (Bandage shows a large blob of ~ 50Mb which looks like bacterial genomes, but coverage is "only" ~ 15x, so not super great). Data is currently unpublished so I don't think we can share it right now, but I'll follow up as soon as this would be possible, and we are also moving forward with further tests given the promising results on this first dataset.

xfengnefx commented 3 years ago

@simroux I've seen the blob thing in one private dataset we have access to (readme only showed the sheep because the private one was shared for internal dev only). It's probably strains + horizontal gene transfer things. For a few datasets we don't have direct access to, some did much better than the others, and library preparation approach also seem to affect the outcome. There's a work in progress fix, I'll push to the repo if it works out.

Closing this issue for now, but please feel free to follow up or drop a new one anytime. Thank you!

xfengnefx / hifiasm-meta

GFA file size issue #2