wilhelm-lab / oktoberfest

Rescoring and spectral library generation pipeline for proteomics.
MIT License
35 stars 8 forks source link

Question regarding the numThreads parameter #303

Open mhamaneh opened 1 week ago

mhamaneh commented 1 week ago

Hi

I have a question regarding the numThreads parameter : The documentation states that this parameter "needs to be balanced with batchsize". What does this mean? Also, is there a way to run Oktoberfest/Prosit on a local machine without using the server?

Thanks

picciama commented 5 days ago

Hi @mhamaneh,

numThreads paramer and batchSize

numThreads = affects overall memory consumption batchSize = affects how large the memory peaks are, when a batch is deleted and refilled

max. memory consumption = batchSize * numThreads.

The preset configuration should be fien in almost any scenarios for personal computers. If you run oktoberfest on large server machines, you can up the batchsize or the number of threads, but don't use more than 10 threads at any times, which we found out to be problematic with the servers! Instead, use larger batchsize then.

Explanation

Imagine you only had one thread and an infinite batch size. In this scenario, all the predictions would be retrieved, stored in memory, then written to the file. Therefore, this is likely requiring way more memory, then you have available on your machine, since spectral libraries can easily exceeds 100GB in size depending on what you do. Also this, would be a waste of time, since while you retrieve predictions, you could already start writing them to disk.

This is why Oktoberfest does this in parallel: 1 thread is always waiting for predictions to write, numThreads threads are always retrieving predictions with batchSize peptides at once.

But still, you could predict faster than you write, in which case, your memory would blow up over time. At the same time, you want to ensure that you are done as fast as possible. The strategy Oktoberfest uses, is a fixed memory slot number for batches, which is eactly numThreads. So the total memory you can at most have, can be calculated by batchSize * numThreads.

This figure illustrates how it works: image

How to optimize the two parameters

What you want is to keep the writer process busy, i.e. there should always be data available in memory to write to disk, and you should have enough prediction threads to fill up the available memory slots. If they are full, the prediction threads will then wait, until a slot gets freed up by the writer process. This way, you have the fastest possible way to generate a spectral library. where you are only resctricted by how fast your machine can write data to disk.

Local predictions

While Oktoberfest relies on Koina to receive its predictions supporting multiple state-of-the-art peptide property prediction models, there are efforts to ensure local predictions can be made. This works by designing a local model through the DLomix package, but the feature is currently not yet 100% ready. Once it is, people can either use Prosit-based template models, or design their own models using Dlomix: https://dlomix.readthedocs.io/en/main/ With regards to this, I think @WassimG and @omsh should be able to provide you with more information.

As an alternative to this, you can also set up your own local Koina server. We provide a docker container that should be relatively easy to be set up. You can then simply use the server address for your local server in the config file, i.e. prediction_server=https://my.awesome.predictionserver.com. If you would like to know more about this, please ask @LLautenbacher and check the instructions: https://github.com/wilhelm-lab/koina?tab=readme-ov-file#hosting-your-own-server