zqfang / GSEApy

Gene Set Enrichment Analysis in Python
http://gseapy.rtfd.io/
BSD 3-Clause "New" or "Revised" License
561 stars 117 forks source link

Memory leak #142

Closed Floreuzan closed 2 years ago

Floreuzan commented 2 years ago

Setup

I am reporting a problem with GSEApy version, Python version, and operating system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import gseapy; print(gseapy.__version__)

3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)] CPython Windows-10-10.0.19041-SP0 0.10.5

Expected behaviour

I want to run the gp.prerank() function.

        pre_res = gp.prerank(rnk=rnk_file, gene_sets=geneset_file,
                    processes=4,
                    permutation_num=100, # reduce number to speed up testing
                    outdir= 'rnk_output/' + rnk_file+ '_' + geneset_file, 
                    format='png', 
                    seed=6,
                    no_plot=True)

Actual behaviour

I choose the C2 geneset from the MSigDB website, it has approximatively 6300 genesets. Even though I call the function on a system with ~80 GB of RAM with swap space, it appears to be a memory leak because using swap space does not slow the calculation down -it's not going back to the memory it has used previously.

Attempted fix

To solve this issue, I modified the fie GSEApy/gseapy/algorithm.py, in the function gsea_compute(), it calls for the function Parallel() from the joblib package, then you can rermove the require=’sharedmen’ option (line 509).

In other words,

        temp_esnu = Parallel(n_jobs=processes, require='sharedmem')(delayed(enrichment_score)( 
                        gl, cor_vec, gmt.get(subset), w, n, 
                        rs, single, scale) 
                        for subset, rs in zip(subsets, random_seeds)) 

becomes:

        temp_esnu = Parallel(n_jobs=processes)(delayed(enrichment_score)( 
                        gl, cor_vec, gmt.get(subset), w, n, 
                        rs, single, scale) 
                        for subset, rs in zip(subsets, random_seeds)) 
zqfang commented 2 years ago

Thanks @Floreuzan. I also need a more clean code that cost fewer memroy. But it seems require a little bit effort to refactor the code.

zqfang commented 2 years ago

any up comming release of GSEApy which re-written in Rust will fix the problem here !!! Stay tune

zqfang commented 2 years ago

The Rust binding of GSEApy (v0.11.0) has been released. Close the issue now