wfondrie / depthcharge

A deep learning toolkit for mass spectrometry
https://wfondrie.github.io/depthcharge/
Apache License 2.0
59 stars 18 forks source link

Preprocess spectra before adding to index #31

Closed bittremieux closed 1 year ago

bittremieux commented 1 year ago

I did some coding on my flight to Japan and moved the preprocessing code from Casanovo to depthcharge to fix Casanovo#56. The upshot is that invalid spectra are now no longer included in the index on the depthcharge side, rather than later being replaced by a dummy spectrum by Casanovo that will get an incorrect prediction.

This needs some unit tests, but I'm unlikely to be able to add those within the next two weeks while I'm in Japan.

One remaining question is whether we also want to introduce a parameter that specifies the minimum number of peaks (after preprocessing) for spectra to be included?

Linked to Casanovo#181.

bittremieux commented 1 year ago

I actually had an idea to implement this better. Instead of just moving all of the preprocessing options from Casanovo to depthcharge, maybe it would be cleaner if the preprocessing happens on the depthcharge side, but it is done by giving it a callable with a specific signature (i.e. raw spectrum in, processed spectrum out, ValueError to discard the spectrum). That would make the preprocessing in depthcharge much more general, with each application that can fully define preprocessing based on its preferences. And it still solves the initial problem.

What do you think @wfondrie? The initial coding was done at the tail end of a 14h flight and I thought of this alternative while only sleeping 3 hours last night due to jet lag, so ymmv.

wfondrie commented 1 year ago

Nice! I think this is a good start. One of the things to consider is making even more flexible. Instead of having a fixed set of transformations, we could execute arbitrary functions, kind of like I setup for the Dataset classes: https://github.com/wfondrie/depthcharge/blob/25d617f5ad421a5c0dcefd19628449293b414da3/depthcharge/data/datasets.py#L23-L26

If we did do that, I would think we'd actually want to move the preprocessing_fn to the parsing side of things, rather than as something the dataset does. The only reason I made this design choice originally is so that someone could re-use the same index and try out multiple preprocessing functions, obviously with the runtime cost of preprocessing. I'm open to arguments to the latter though!

bittremieux commented 1 year ago

Indeed, that was my proposal as well. I forgot that there's a preprocessing_fn in the dataloader already. Why did we reimplement basically the same thing in Casanovo then instead of using preprocessing_fn from depthcharge?

I think that moving preprocessing_fn to the parsers is a good solution. I agree that conceptually you might not want to recreate the index every time a simple preprocessing option changes, but in practice I doubt that's going to happen very often. (We don't change it in Casanovo, and for ANN-SoLo I've noticed the same and I'm making a similar change.) Whereas this change would avoid the issue with not being able to skip invalid items in a dataloader. So the upside outweighs the potential negative aspect imo.

wfondrie commented 1 year ago

Why did we reimplement basically the same thing in Casanovo then instead of using preprocessing_fn from depthcharge?

To be honest, I don't know 🤷‍♂️ - I think originally Melih copied and pasted the dataset code from depthcharge because it needed some minor tweaking for casanovo, but I don't know if that's still the case.

I think that moving preprocessing_fn to the parsers is a good solution. I agree that conceptually you might not want to recreate the index every time a simple preprocessing option changes, but in practice I doubt that's going to happen very often. (We don't change it in Casanovo, and for ANN-SoLo I've noticed the same and I'm making a similar change.) Whereas this change would avoid the issue with not being able to skip invalid items in a dataloader. So the upside outweighs the potential negative aspect imo.

I'm fine with this change - I went back and forth debating with my original implementation anyway.

bittremieux commented 1 year ago

Superseded by #34.