Provide tokenization function

bittremieux commented 2 years ago

Have a separate tokenization function to unify this behavior across depthcharge (i.e. to calculate masses from peptide sequences) and Casanovo (during evaluation of the predictions).

bittremieux commented 2 years ago

One more related thing is that Casanovo sometimes apparently predicts a double N-terminal acetylation, e.g. +43.006+43.006PEPTIDEK, in which case the +43.006+43.006 is not found in the AA mass dictionary. How should we handle such weird predictions that Casanovo comes up with? Currently mass calculation crashes with a KeyError, which doesn't seem desired behavior.

wfondrie commented 2 years ago

@bittremieux, this also brings up a larger question in my mind: should we try and parse peptides sequences and their modifications into the same format, say the mass enclosed in []? Such standardization could help prevent these edge cases and I think I could change the parser to specifically handale [], (), and non-enclosed modifications.

bittremieux commented 2 years ago

Good point. General parsing functionality could be useful to harmonize different ways people encode peptide sequences. Now we've kinda been using the format that MassIVE uses, but there are various different flavors. A parsing function that translates sequences to a consistent internal representation would enable depthcharge/Casanovo to deal with different input formats. (The dev version of) spectrum_utils also supports the full ProForma specification to encode peptidoforms, including controlled vocabularies to encode modifications.

On the other hand, I think there will always be edge cases. Is it worth the effort and likely continuing trouble to have wide support, versus just telling people what they should use? This is only a tangential part of the functionality.

wfondrie commented 2 years ago

Good points. I think you're right that we shouldn't worry too much about parsing. However, I am inclined to let users specify a custom regex if they have a different style of modification.

One more related thing is that Casanovo sometimes apparently predicts a double N-terminal acetylation, e.g. +43.006+43.006PEPTIDEK, in which case the +43.006+43.006 is not found in the AA mass dictionary. How should we handle such weird predictions that Casanovo comes up with? Currently mass calculation crashes with a KeyError, which doesn't seem desired behavior.

This would be easy to account for, except that carbamylation with an ammonia neutral loss "+43.006-17.027" is apparently a n-terminal modification that I found when looking at MassIVE-KB. Is that a mistake on my part?

More generally, it may be better to create an new token for every terminal modification + residue.

Either way, ideally then we would restrict Casanovo to only predict the non-terminal tokens during inference.

Thoughts?

bittremieux commented 2 years ago

This would be easy to account for, except that carbamylation with an ammonia neutral loss "+43.006-17.027" is apparently a n-terminal modification that I found when looking at MassIVE-KB. Is that a mistake on my part?

No, I've seen it in the data as well and I never really worried about it because I figured it must be fine. Does it not make sense chemically?

More generally, it may be better to create an new token for every terminal modification + residue.

You mean have +43.006A, +43.006C, +43.006D, ... ? That's going to expand our alphabet massively, and there likely won't be that many examples for each of those. Although you indeed see only a single peak for the combination of the N-terminal modification and the first amino acid, rather than separate peaks for each, I think practically it's not the best solution.

Either way, ideally then we would restrict Casanovo to only predict the non-terminal tokens during inference.

Indeed, I'd been thinking about that as well. It's as easy as masking out the indexes that correspond to N-terminal modifications after the first token has been predicted. But we should maybe think about how to represent that in the model. We can hardcode those indexes based on our current AA alphabet, but ideally the model should be able to do this dynamically. Some kind of flag in the AA alphabet config to indicate whether certain tokens are restricted to C- or N-term?

bittremieux commented 1 year ago

Fixed in #34.

wfondrie / depthcharge

Provide tokenization function #18