wilhelm-lab / spectrum_fundamentals

Fundamentals public repo
MIT License
8 stars 2 forks source link

Cryptic error message when encountering unknown modifications #9

Closed MatthewThe closed 5 months ago

MatthewThe commented 1 year ago

Describe the bug

_get_modifications() in fragments.py returns None when it encounters a sequence with an unknown modification:

https://github.com/wilhelm-lab/spec_fundamentals/blob/3407711ab5b2d46f3c5702433bb869d251219340/fundamentals/fragments.py#L30-L34

It does raise a logger.info message, but the upstream functions do not handle this behavior, e.g. the function initialize_peaks() in fragments.py errors out without a useful error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/root/oktoberfest/re_score.py", line 29, in calculate_features_single
    features.predict_with_aligned_ce(df_search)
  File "/root/oktoberfest/calculate_features.py", line 24, in predict_with_aligned_ce
    self.perform_alignment(df_search)
  File "/root/oktoberfest/ce_calibration.py", line 170, in perform_alignment
    self.gen_lib(df_search)
  File "/root/oktoberfest/ce_calibration.py", line 98, in gen_lib
    df_annotated_spectra = annotate_spectra(df_join)
  File "/usr/local/lib/python3.8/site-packages/fundamentals/annotation/annotation.py", line 112, in annotate_spectra
    results = parallel_annotate(row, index_columns)
  File "/usr/local/lib/python3.8/site-packages/fundamentals/annotation/annotation.py", line 182, in parallel_annotate
    fragments_meta_data, tmt_n_term, unmod_sequence, calc_mass = initialize_peaks(spectrum[index_columns['MODIFIED_SEQUENCE']],
  File "/usr/local/lib/python3.8/site-packages/fundamentals/fragments.py", line 112, in initialize_peaks
    modification_deltas, tmt_n_term, peptide_sequence = _get_modifications(peptide_sequence)
TypeError: cannot unpack non-iterable NoneType object

To Reproduce

Steps to reproduce the behavior:

  1. Pass a sequence with an unknown modification in the MODIFIED_SEQUENCE column, e.g. AALDSIN(as)ELPENIL

Expected behavior

We can either skip such sequences in the parent functions or raise an exception inside _get_modifications() that the parent function can choose to deal with or not.

System [please complete the following information]:

picciama commented 1 year ago

The behaviour has changed since you opened this issue. There is an AssertionError raised now after the return value of _get_modifications has been checked. https://github.com/wilhelm-lab/spectrum_fundamentals/blob/4d587fd4831f7079a3d37fa756fa1d4431f1318c/spectrum_fundamentals/fragments.py#L111-L127 The reason you got this TypeError is because the function returns None in case of an unknown modification. In order to fix this everywhere, I propose moving the check into the function itself instead of returning None.

Skipping in the parent functions is a question of whether you want the program to exit entirely or process the remainder of the sequences. We can handle this by wrapping the required parent functions into try except blocks and handle this accordingly in the except/finally block while raising the exception in _get_modifications. This would be the proper way of handling this. We can still keep the logger output, but it should definitely be an error level and not just info, as it is right now.