Open dmalzl opened 2 months ago
how large the gene_set size ?
You can iterate through the gene_set ( one set )per run, this may reduce the memory. I will see if I can reduce the memory usage in the Rust backend.
There are 7300 gene sets in my GMT file. However, I fear running them separately will increase computation drastically as eunning them in one already takes 17h.
Thanks for looking into it. As I said it is not really a problem for me I just noticed it.
As fellow Python user who does not want to switch to R too often I love the gseapy package. Recently, I discovered that it also offers a GSVA implementation. To see if this interpretation helps with what I am tyring to achieve with my data, I decided to give it a try and ran it using 32 cores. The data at hand is 13k samples x 36k genes which is quite large but easily fits in 32GB of RAM (depending on the representation). However, I had quite some issues with the memory consumption of the implementation where it would stay at around 30GB for most of the run but blows out of proportion to 220GB at the end. I did not dig into the code yet but I feel like this is a bit out of hand. From experience with Python/R interfaces I can only guess that this may come from shuttling data between Rust and Python or maybe is due to the data being converted from sparse to dense at some point of the algorithm. Other options also include all 32 threads receiving a copy of the same data. In any case, although I am lucky our system can handle such a large amount of RAM usage, I feel like there is an opportunity to make the algorithm more memory efficient to also allow others to use it on large datasets.
Any thoughts on why this is and how to improve on it?