Closed alienzj closed 3 years ago
I see the same behavior with a "real" (i.e. non-simulated) metagenomics PacBio library. All the "*.bin" files seem to be correctly written, but all the other files are empty. I also see the same absence of overlaps:
[M::hamt_assemble::4567.208*11.17@17.077GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0
But other assemblers are able to generate contigs from the same library, so I would expect some overlaps to be found ?
@alienzj Sorry for the late reply. It's weird that you didn't get "ha_hist_line" lines (just did a fresh clone + test run, which was sane). This step should be identical to the stable hifiasm (since no read selection was involved), and local test runs were all sane, I'm not sure where did it go wrong. Could you try assemble with the stable hifiasm and see if the overlap states look right?
@simroux Sorry for the late reply. What's the full STDERR output? Is there "ha_hist_line" lines? If there's none, could you try assemble with the stable hifiasm? If hifiasm prints out ha_hist_line lines and reports overlaps, then it's my bug. I think the issue you encountered is similar to @alienzj 's, but I'm not sure why yet, since local test runs have been alright...
@xfengnefx : I do see the "ha_hist_line" lines, but not sure if they reflect overlap or not. I attached the full log, hopefully it will help clarify what happens ! asm.log
@simroux Thanks for the log, then probably it's a different problem than the main post of this thread. The read length distribution looks unusual. Is this a pacbio hifi dataset, or hifi mixed with other libraries? I think we usually expect hifi reads to fall within the 5kb~20kb range, 1kb seems very short.
@xfengnefx Pretty sure these are HiFi reads (will get confirmation tomorrow). We typically try to aim the ~ 5-20k range, but real environmental samples being what they are, we often end up with a fair number of (very) short reads.
@simroux Thank you, I appreciate it!
Anyway, it's weird to see no overlaps at all. What's the file size of *ovecinfo.bin? This file logs all overlaps ever calculated, regardless of whether they were threw away later on. We've seen hifiasm_meta failed on two real private samples, but that was fragmented contigs, not no overlaps. And another assembler also failed those samples.
I'm not sure why, let me discuss this with others tomorrow and see what they think.
@xfengnefx bin files seem relatively small to me: 1.3G Dec 29 19:55 asm.ec.bin 41M Dec 29 19:55 asm.ec.mt.bin 21M Dec 29 19:55 asm.ovecinfo.bin 11M Dec 29 19:55 asm.ovlp.reverse.bin 11M Dec 29 19:55 asm.ovlp.source.bin
We're not in a rush, but curious to know if this is an issue with our input file.
@alienzj Sorry for the late reply. It's weird that you didn't get "ha_hist_line" lines (just did a fresh clone + test run, which was sane). This step should be identical to the stable hifiasm (since no read selection was involved), and local test runs were all sane, I'm not sure where did it go wrong. Could you try assemble with the stable hifiasm and see if the overlap states look right?
Hi, I test CAMI data (no Hifi) with stable hifiasm (not hifiasm_meta).
$ hifiasm -o test_hifiasm/cami_0.hifiasm.out -t 32 /zfssz3/pub/database/openstack.cebitec.uni-bielefeld.de/swift/v1/C
AMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_ft_gen::483.745*3.69@16.369GB] ==> filtered out 1384687 k-mers occurring -5 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::662.355*4.53] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::713.792*5.53] ==> indexed 0 positions
[M::ha_assemble::742.572*6.55@16.369GB] ==> corrected reads for round 1
[M::ha_assemble] # bases: 5000000696; # corrected bases: 0; # recorrected bases: 0
[M::ha_assemble] size of buffer: 0.131GB
[M::ha_pt_gen::797.425*7.35] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::849.905*8.01] ==> indexed 0 positions
[M::ha_assemble::880.619*8.84@16.369GB] ==> corrected reads for round 2
[M::ha_assemble] # bases: 5000000696; # corrected bases: 0; # recorrected bases: 0
[M::ha_assemble] size of buffer: 0.131GB
[M::ha_pt_gen::934.191*9.41] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::984.778*9.89] ==> indexed 0 positions
[M::ha_assemble::1012.330*10.48@16.369GB] ==> corrected reads for round 3
[M::ha_assemble] # bases: 5000000696; # corrected bases: 0; # recorrected bases: 0
[M::ha_assemble] size of buffer: 0.131GB
[M::ha_pt_gen::1066.756*10.89] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::1117.322*11.24] ==> indexed 0 positions
[M::ha_assemble::1143.542*11.71@16.369GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
[M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] purge duplication coverage threshold: -1
[M::purge_dups] purge duplication coverage threshold: -1
[M::adjust_utg_by_primary] primary contig coverage range: [0, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
[M::main] Version: 0.13-r307
[M::main] CMD: hifiasm -o test_hifiasm/cami_0.hifiasm.out -t 32 /zfssz3/pub/database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
[M::main] Real time: 1154.208 sec; CPU: 13399.291 sec; Peak RSS: 16.369 GB
There are no "ha_hist_line" lines I can see. Maybe non-hifi reads do not meet the requirements of hifiasm.
Thanks.
@alienzj Thanks for the confirmation. Looks like it's hifiasm's behavior.
@xfengnefx : Quick update: after I double checked, it turned out that my previous input file was a mix of ccs and non-ccs reads. I have now re-run hifiasm-meta on the same data but with only the ccs reads, and this gave me a nice assembly (none of the complete circular chromosome sadly, but definitely on par or better compared to our previous attempt with the same reads). So this was a user problem, i.e. I should have read more carefully that the input really must be ccs reads, not any PacBio reads :-) Thanks for the quick replies and the help with troubleshooting my mistake !
@simroux great, glad to know it's not an unknown bug on my side, thank you for the confirmation! I guess if non-ccs reads are longer than ccs, hifiasm's containment removal step might happen to favor the longer non-ccs ones (and overlaps between them are then discarded because of quality). The meta fork has heuristics for hifi low coverage + het corner cases, but probably can't help with non-ccs reads. It's just my guessing...
For the assembly performance, does the dataset expect low coverages? Is it mostly plasmids or very short circular genomes? We saw one case of low coverage and one case of shared sequence causing troubles in the mock datasets, and I'm currently fixing plasmids. The heuristic needs more test data to improve, however. We definitely appreciate it if you are happy to share data for development. Thanks!
@xfengnefx I think it's mostly low-coverage and some strain variation (Bandage shows a large blob of ~ 50Mb which looks like bacterial genomes, but coverage is "only" ~ 15x, so not super great). Data is currently unpublished so I don't think we can share it right now, but I'll follow up as soon as this would be possible, and we are also moving forward with further tests given the promising results on this first dataset.
@simroux I've seen the blob thing in one private dataset we have access to (readme only showed the sheep because the private one was shared for internal dev only). It's probably strains + horizontal gene transfer things. For a few datasets we don't have direct access to, some did much better than the others, and library preparation approach also seem to affect the outcome. There's a work in progress fix, I'll push to the repo if it works out.
Closing this issue for now, but please feel free to follow up or drop a new one anytime. Thank you!
Hi, I use hifiasm-meta to assemble urogenital tract metagenomics data from CAMI.
This data was simulated by CAMISIM, average read length: 3,000 bp, read length s.d.: 1,000 bp.
Run log:
Output:
All GFA file size is zero.
Any help? Thanks ~