wejlab / MetaScope

An R-based approach for preprocessing and aligning 16S, metagenomic, and metatranscriptomic data (PathoScope version 3.0)
GNU General Public License v3.0
16 stars 7 forks source link

Computational requirement for metascope #24

Closed esraagithub closed 7 months ago

esraagithub commented 7 months ago

I turned to ubuntu server with 30 Gb Ram because it takes very long time to download data in windows Rstudio. The process has been already running in the ubuntu server for several days to download data spcified here all_species= c("Eukaryote", "bacteria", "Viruses")

Finally the process were killed suddenly. The downloaded fasta file reached 150 Gb When i opend the kernel.log file i found this message:

Jan 20 23:40:05 vmi1591041 kernel: [1043309.450125] R invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jan 20 23:40:05 vmi1591041 kernel: [1043309.450151] CPU: 0 PID: 180410 Comm: R Not tainted 5.4.0-105-generic #119-Ubuntu Jan 20 23:40:05 vmi1591041 kernel: [1043309.450153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org> Jan 20 23:40:05 vmi1591041 kernel: [1043309.450210] Call Trace: Jan 20 23:40:05 vmi1591041 kernel: [1043309.450650] dump_stack+0x6d/0x8b Jan 20 23:40:05 vmi1591041 kernel: [1043309.450720] dump_header+0x4f/0x1eb Jan 20 23:40:05 vmi1591041 kernel: [1043309.450722] oom_kill_process.cold+0xb/0x10 Jan 20 23:40:05 vmi1591041 kernel: [1043309.450856] out_of_memory+0x1cf/0x4d0 Jan 20 23:40:05 vmi1591041 kernel: [1043309.450974] alloc_pages_slowpath+0xd5e/0xe50 Jan 20 23:40:05 vmi1591041 kernel: [1043309.450983] alloc_pages_nodemask+0x2d0/0x320 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451029] alloc_pages_current+0x87/0xe0 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451069] page_cache_alloc+0x72/0x90 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451073] pagecache_get_page+0xbf/0x300 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451076] filemap_fault+0x6b2/0xa50 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451135] ? unlock_page_memcg+0x12/0x20 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451138] ? page_add_file_rmap+0xff/0x1a0 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451173] ? xas_load+0xd/0x80 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451177] ? xas_find+0x17f/0x1c0 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451180] ? filemap_map_pages+0x24c/0x380 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451315] ext4_filemap_fault+0x32/0x50 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451319] do_fault+0x3c/0x130 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451321] do_fault+0x24b/0x640 Jan 20 23:40:05 vmi1591041 kernel: [1043309.451353] ? __switch_to_asm+0x40/0x70

Is it means 30 Gb ram isn't enough? If so, why the process last for several days and the fasta file in the output directory reached 150 Gb? What is the computing requirement to run metascope

I hope if you have a fasta file prepared for whole of part of the taxa in the taxonomy_table to be available for people with less computational resources

aubreyodom commented 7 months ago

Hi,

The computational requirement is significantly lessened if you split up a broader domain like that of Eukarya into smaller pieces. That way the FASTA files can be created in smaller chunks. In your analysis, do you really need to map your reads to all eukaryotes?

Often times we will filter a human microbiome sample against all human reads (homo sapiens).

I hope if you have a fasta file prepared for whole of part of the taxa in the taxonomy_table to be available for people with less computational resources

The computational burden is significantly lessened with the above suggestion. Our group runs the downloading step on another server and still has high runtime for larger genomes like Homo sapiens, but it is nowhere near 130 hours.

Thanks, Aubrey