wfondrie / depthcharge

A deep learning toolkit for mass spectrometry
https://wfondrie.github.io/depthcharge/
Apache License 2.0
59 stars 18 forks source link

Scan ids #13

Closed BercesteDincer closed 1 year ago

BercesteDincer commented 2 years ago

Hi @wfondrie,

I have modified the newest version to add the scan ids. I made all the previous changes you suggested. I also enabled shuffling the dataloader with an additional parameter.

I had to made minor modifications to the tests since now the batch contains one more element. It passes all the tests and I also formatted to code with black.

Hope the changes make sense, happy to make extra edits/fixes!

wfondrie commented 2 years ago

I've made a few changes to this PR:

Note that we need more unit tests to cover more than just MGF files, but I don't have the bandwidth for that just yet.

This PR should resolve Noble-Lab/casanovo#70

codecov-commenter commented 2 years ago

Codecov Report

Merging #13 (d575b09) into main (945d6b6) will increase coverage by 82.33%. The diff coverage is 74.32%.

@@            Coverage Diff            @@
##           main      #13       +/-   ##
=========================================
+ Coverage      0   82.33%   +82.33%     
=========================================
  Files         0       19       +19     
  Lines         0      832      +832     
=========================================
+ Hits          0      685      +685     
- Misses        0      147      +147     
Impacted Files Coverage Δ
depthcharge/data/__init__.py 100.00% <ø> (ø)
depthcharge/models/__init__.py 100.00% <ø> (ø)
depthcharge/models/denovo/model.py 89.42% <0.00%> (ø)
depthcharge/data/parsers.py 63.80% <54.54%> (ø)
depthcharge/data/loaders.py 91.48% <66.66%> (ø)
depthcharge/data/hdf5.py 90.50% <86.48%> (ø)
depthcharge/data/datasets.py 97.40% <87.50%> (ø)
depthcharge/components/feedforward.py 22.72% <0.00%> (ø)
depthcharge/version.py 53.84% <0.00%> (ø)
... and 16 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

wfondrie commented 1 year ago

@bittremieux, a quick question to ponder as I make revisions:

thought: Note that scan numbers from Thermo files typically have a prefix that's not included in a USI. I.e. a scan id might look like this: controllerType=0 controllerNumber=1 scan=1 (controller distinguishes between Thermo sequencing instruments and mass specs, but in our case will always be 0 and 1 fyi). This is considered a "nativeID" by the USI specification. To follow the spec, we can convert to scan numbers by removing the prefix if it's present.

In light of this, I think all of the identifiers could be represented as integers instead of strings. I think I should modify the code to store the integer and create the correct string when asked for a PSI-compatible ID. What do you think?

bittremieux commented 1 year ago

Yes, I think that's a relevant change. Both scan numbers and index numbers are indeed integers, so storing them as such would be much more memory-efficient. There will need to be a distinction between scan numbers (from mzML and mzXML) and index numbers (from MGF) though to be able to properly construct the ID string afterwards though.