xfengnefx / hifiasm-meta

hifiasm_meta - de novo metagenome assembler, based on hifiasm, a haplotype-resolved de novo assembler for PacBio Hifi reads.
MIT License
60 stars 8 forks source link

Too much RAM memory required for metagenome assembly. #29

Closed machalita closed 1 year ago

machalita commented 1 year ago

Greetings! This is not a bug but rather an optimization issue. We are currently using REVIO data using one sample per cell. This leaves us with a raw file of 30-40 gigabytes to do metagenome assembly. We are currently using a server with 750gb of RAM memory and it seems not to be enough. Is there a parameter that can be tweaked to reduce memory usage? Is there an approximate formula we can use to calculate how much memory is needed?

Thank u very much!

xfengnefx commented 1 year ago

Hi!

Which version are you using, and could you share the log (stderr)? Also, I guess the sample is either not from gut/fecal material, or a pooled sample from many similar individual samples...? In which case I have a local commit that fixes a certain high peak RSS problem. Maybe I should push it.

machalita commented 1 year ago

Hi! Thanks for your prompt reply! It's one stool sample per fastq. What we are doing is building a reference database using hifi revio, that's why we are using only one stool sample per cell, so that way we have longer contigs for the reference. Your question actually made me wonder if there is host DNA on the raw data and that could be the reason? This is the version: hifiasm_meta 0.3-r073 (hifiasm code base 0.13-r308)

Here is the output of one of the runs:

[M::main] Start: Thu Oct 19 17:12:17 2023

[M::hamt_assemble] Skipped read selection. [prof::yak_count] step 1 total 694.07 s, step2 362.13 s, step3 1011.65 s. [M::ha_analyze_count] lowest: count[8] = 1424304 [M::ha_analyze_count] highest: count[30] = 97584106 [M::ha_hist_line] 2: ** 10127388 [M::ha_hist_line] 3: * 3363429 [M::ha_hist_line] 4: 2296285 [M::ha_hist_line] 5: 1850155 [M::ha_hist_line] 6: 1604104 [M::ha_hist_line] 7: 1458033 [M::ha_hist_line] 8: 1424304 [M::ha_hist_line] 9: 1525195 [M::ha_hist_line] 10: 1821511 [M::ha_hist_line] 11: 2355628 [M::ha_hist_line] 12: * 3033452 [M::ha_hist_line] 13: 4174765 [M::ha_hist_line] 14: ** 5678811 [M::ha_hist_line] 15: **** 7757332 [M::ha_hist_line] 16: * 10410188 [M::ha_hist_line] 17: ** 13935321 [M::ha_hist_line] 18: ***** 18413846 [M::ha_hist_line] 19: **** 23886233 [M::ha_hist_line] 20: * 30519807 [M::ha_hist_line] 21: ***** 37989331 [M::ha_hist_line] 22: **** 46354786 [M::ha_hist_line] 23: * 55401470 [M::ha_hist_line] 24: ** 64340262 [M::ha_hist_line] 25: *** 73237353 [M::ha_hist_line] 26: * 81300214 [M::ha_hist_line] 27: ** 88282840 [M::ha_hist_line] 28: **** 93602952 [M::ha_hist_line] 29: ***** 96804178 [M::ha_hist_line] 30: **** 97584106 [M::ha_hist_line] 31: ** 96094424 [M::ha_hist_line] 32: ***** 92682652 [M::ha_hist_line] 33: ** 87529679 [M::ha_hist_line] 34: * 80911527 [M::ha_hist_line] 35: ***** 73298126 [M::ha_hist_line] 36: * 65212438 [M::ha_hist_line] 37: ** 56773590 [M::ha_hist_line] 38: ** 48313475 [M::ha_hist_line] 39: *** 40291220 [M::ha_hist_line] 40: ** 33059098 [M::ha_hist_line] 41: * 26646298 [M::ha_hist_line] 42: **** 20966994 [M::ha_hist_line] 43: 16214006 [M::ha_hist_line] 44: *** 12480532 [M::ha_hist_line] 45: ** 9364133 [M::ha_hist_line] 46: * 7015649 [M::ha_hist_line] 47: * 5198922 [M::ha_hist_line] 48: ** 3869733 [M::ha_hist_line] 49: * 2835076 [M::ha_hist_line] 50: 2050567 [M::ha_hist_line] 51: 1553342 [M::ha_hist_line] 52: 1192912 [M::ha_hist_line] 53: 949363 [M::ha_hist_line] 54: 802113 [M::ha_hist_line] 55: 726049 [M::ha_hist_line] 56: 687219 [M::ha_hist_line] 57: 653101 [M::ha_hist_line] 58: 646124 [M::ha_hist_line] 59: 662057 [M::ha_hist_line] 60: 665511 [M::ha_hist_line] 61: 666643 [M::ha_hist_line] 62: 669793 [M::ha_hist_line] 63: 675624 [M::ha_hist_line] 64: 665216 [M::ha_hist_line] 65: 646753 [M::ha_hist_line] 66: 628077 [M::ha_hist_line] 67: 604381 [M::ha_hist_line] 68: 584115 [M::ha_hist_line] 69: 554289 [M::ha_hist_line] 70: 528338 [M::ha_hist_line] 71: 494115 [M::ha_hist_line] rest: ** 17581454 [M::ha_analyze_count] left: none [M::ha_analyze_count] right: none [M::hamt_ft_gen] peak_hom: 30; peak_het: -1 [M::hamt_ft_gen::1029.413*24.57@62.390GB] ==> filtered out 974339 k-mers occurring 750 or more times [M::hamt_assemble] Generated flt tab.

[M::hamt_assemble] entered read correction round 1 [M::ha_pt_gen] counting - minimzers [prof::yak_count] step 1 total 791.70 s, step2 242.73 s, step3 512.32 s. [M::ha_pt_gen::1831.994*17.50] ==> counted 90734509 distinct minimizer k-mers [M::ha_pt_gen] count[16383] = 1361 (for sanity check) [M::ha_analyze_count] lowest: count[8] = 67172 [M::ha_analyze_count] highest: count[30] = 3787417 [M::ha_hist_line] 1: ****> 24456356 [M::ha_hist_line] 2: **** 608786 [M::ha_hist_line] 3: ** 178787 [M::ha_hist_line] 4: 116426 [M::ha_hist_line] 5: 92328 [M::ha_hist_line] 6: 79180 [M::ha_hist_line] 7: 70836 [M::ha_hist_line] 8: 67172 [M::ha_hist_line] 9: 69313 [M::ha_hist_line] 10: 80361 [M::ha_hist_line] 11: 100290 [M::ha_hist_line] 12: 126122 [M::ha_hist_line] 13: 168677 [M::ha_hist_line] 14: ** 227439 [M::ha_hist_line] 15: **** 307031 [M::ha_hist_line] 16: * 410916 [M::ha_hist_line] 17: ** 546977 [M::ha_hist_line] 18: ***** 721766 [M::ha_hist_line] 19: * 933200 [M::ha_hist_line] 20: *** 1187932 [M::ha_hist_line] 21: * 1474745 [M::ha_hist_line] 22: ***** 1798821 [M::ha_hist_line] 23: * 2151068 [M::ha_hist_line] 24: ** 2495899 [M::ha_hist_line] 25: *** 2836556 [M::ha_hist_line] 26: * 3149774 [M::ha_hist_line] 27: ** 3423664 [M::ha_hist_line] 28: **** 3630426 [M::ha_hist_line] 29: ***** 3754429 [M::ha_hist_line] 30: **** 3787417 [M::ha_hist_line] 31: **** 3733400 [M::ha_hist_line] 32: 3605152 [M::ha_hist_line] 33: ** 3403755 [M::ha_hist_line] 34: ***** 3145668 [M::ha_hist_line] 35: * 2850707 [M::ha_hist_line] 36: ***** 2542444 [M::ha_hist_line] 37: ** 2209581 [M::ha_hist_line] 38: ** 1884197 [M::ha_hist_line] 39: ** 1574312 [M::ha_hist_line] 40: ** 1293341 [M::ha_hist_line] 41: * 1039257 [M::ha_hist_line] 42: ** 820301 [M::ha_hist_line] 43: **** 634294 [M::ha_hist_line] 44: 488605 [M::ha_hist_line] 45: ** 367413 [M::ha_hist_line] 46: ** 275713 [M::ha_hist_line] 47: 203191 [M::ha_hist_line] 48: * 151040 [M::ha_hist_line] 49: 111831 [M::ha_hist_line] 50: 80421 [M::ha_hist_line] 51: 60981 [M::ha_hist_line] 52: 46922 [M::ha_hist_line] 53: 37343 [M::ha_hist_line] 54: 31106 [M::ha_hist_line] 55: 27904 [M::ha_hist_line] 56: 26260 [M::ha_hist_line] 57: 25190 [M::ha_hist_line] 58: 24665 [M::ha_hist_line] 59: 24742 [M::ha_hist_line] 60: 25325 [M::ha_hist_line] 61: 25320 [M::ha_hist_line] 62: 25679 [M::ha_hist_line] 63: 25817 [M::ha_hist_line] 64: 25455 [M::ha_hist_line] 65: 24870 [M::ha_hist_line] 66: 24162 [M::ha_hist_line] 67: 23185 [M::ha_hist_line] 68: 22432 [M::ha_hist_line] 69: 21430 [M::ha_hist_line] 70: 20177 [M::ha_hist_line] 71: 19286 [M::ha_hist_line] rest: ** 678941 [M::ha_analyze_count] left: none [M::ha_analyze_count] right: none [M::ha_pt_gen] peak_hom: 30; peak_het: -1 [M::ha_pt_gen] counting - minimzer positions [prof::yak_count] step 1 total 48.03 s, step2 356.28 s, step3 650.25 s. [debug::ha_pt_gen] tot_cnt is 2139458679, pt->tot_pos is 2139458679 [M::ha_pt_gen::2488.172*15.72] ==> indexed 2139458679 positions Killed

xfengnefx commented 1 year ago

Thank you for the log. The k-mer histogram is a bit strange. I think the bell shape peaking at 30x is unusual for a metagenome sample. Could you check the input file to make sure it's not an eukaryotic library by accident?

I also pushed a f98f1ad to meta_dev branch, you could try it and see if it helps. Please post log if it's still killed or crashed, thanks.

machalita commented 1 year ago

Thank you so much for your help! The samples that I was trying to assemble did contain a high percent of host reads, so after filtering them out I was able to successfully assemble it. An apology for my rookie mistake, as it has never happened to me with stool samples, but then I found out these were "bloody" stools, so they did contain plenty of host DNA =p Thank you!

xfengnefx commented 1 year ago

Glad it worked out :D and thank you for the testing. Closing, please feel free to reopen/post new if encountering any problem.