wilhelm-lab / oktoberfest

Rescoring and spectral library generation pipeline for proteomics.
MIT License
33 stars 8 forks source link

Rescoring the Sage output doesn't work #211

Closed GiammaFer75 closed 4 months ago

GiammaFer75 commented 5 months ago

Describe the bug

I identified peptides using Sage. Then I tried to rescore the Sage output with Oktoberfest. Unfortunately, I got this error:

File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/spectrum_fundamentals/fragments.py", line 43, in _get_modifications
    modification_deltas[start_pos - offset] = constants.MOD_MASSES[peptide_sequence[start_pos:end_pos]]
KeyError: '[+57.0215]'

To Reproduce

Steps to reproduce the behavior:

  1. With Sage, I performed the peptide identification with this json file:
    
    {
    "database": {
        "bucket_size": 8192,
        "enzyme": {
            "missed_cleavages": 2,
            "min_len": 7,
            "max_len": 50,
            "cleave_at": "KR",
            "restrict": "P"
        },
        "fragment_min_mz": 100.0,
        "fragment_max_mz": 2000.0,
        "peptide_min_mass": 500.0,
        "peptide_max_mass": 5000.0,
        "ion_kinds": [
            "b",
            "y"
        ],
        "min_ion_index": 2,
        "max_variable_mods": 3,
        "static_mods": {
            "C": 57.0215
        },
        "variable_mods": {
            "M": [15.994]
        },
        "decoy_tag": "rev_",
        "generate_decoys": true,
        "fasta": "/HDD/okt_frac/HUMAN_swiss_072022.fasta"
    },
    "quant": {
        "lfq": true,
        "lfq_settings": {
            "peak_scoring": "Hybrid",
            "integration": "Sum",
            "spectral_angle": 0.6,
            "ppm_tolerance": 5.0
        }
    },
    "precursor_tol": {
        "ppm": [
            -20.0,
            20.0
        ]
    },
    "fragment_tol": {
        "ppm": [
            -20.0,
            20.0
        ]
    },
    "isotope_errors": [
        0,
        2
    ],
    "deisotope": true,
    "chimera": false,
    "wide_window": false,
    "predict_rt": true,
    "min_peaks": 15,
    "max_peaks": 150,
    "max_fragment_charge": 1,
    "min_matched_peaks": 4,
    "report_psms": 1,
    "output_directory": "/HDD/okt_frac/sage_out/",
    "mzml_paths": [
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr01_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr02_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr03_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr04_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr05_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr06_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr07_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr08_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr09_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr10_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr11_uncalibrated.mzML",
        "/HDD/okt_frac/20072022_RZC_STP0019_liverproject_sample061_fr12_uncalibrated.mzML"
    ]
    }
2. With Oktoberfest I tried to rescore with this json file:

{ "type": "Rescoring", "tag": "", "output": "/HDD/sage_test2/oktresc_sage/", "inputs": { "search_results": "/HDD/sage_test2/sage_out/results.sage.tsv", "search_results_type": "Sage", "spectra": "/HDD/sage_test2", "spectra_type": "raw" }, "models": { "intensity": "Prosit_2020_intensity_HCD", "irt": "Prosit_2019_irt" }, "prediction_server": "koina.wilhelmlab.org:443", "numThreads": 100, "fdr_estimation_method": "mokapot", "allFeatures": false, "regressionMethod": "spline", "ssl": true, "thermoExe": "/SSD/ThermoRawFileParser1.4.3/ThermoRawFileParser.exe", "massTolerance": 20, "unitMassTolerance": "ppm", "ce_alignment_options": { "ce_range": [ 19, 50 ], "use_ransac_model": false } }


**System :**

-   OS: Ubuntu 22.04
-   Language Version: Python 3.10.14
-   Virtual environment: Mamba

**Additional context**

<!-- Add any other context about the problem here. -->

The full error is this:

(oktoberfest) thalassinoslab@scylla:/HDD/sage_test2$ oktoberfest --config_path Rescoring_sage.json 2024-04-26 17:19:20,203 - INFO - oktoberfest.utils.config::read Reading configuration from Rescoring_sage.json 2024-04-26 17:19:20,203 - INFO - oktoberfest.runner::run_job Oktoberfest version 0.6.2 Copyright 2024, Wilhelmlab at Technical University of Munich 2024-04-26 17:19:20,204 - INFO - oktoberfest.runner::run_job Job executed with the following config: 2024-04-26 17:19:20,204 - INFO - oktoberfest.runner::run_job { "type": "Rescoring", "tag": "", "output": "/HDD/sage_test2/oktresc_sage/", "inputs": { "search_results": "/HDD/sage_test2/sage_out/results.sage.tsv", "search_results_type": "Sage", "spectra": "/HDD/sage_test2", "spectra_type": "raw" }, "models": { "intensity": "Prosit_2020_intensity_HCD", "irt": "Prosit_2019_irt" }, "prediction_server": "koina.wilhelmlab.org:443", "numThreads": 100, "fdr_estimation_method": "mokapot", "allFeatures": false, "regressionMethod": "spline", "ssl": true, "thermoExe": "/SSD/ThermoRawFileParser1.4.3/ThermoRawFileParser.exe", "massTolerance": 20, "unitMassTolerance": "ppm", "ce_alignment_options": { "ce_range": [ 19, 50 ], "use_ransac_model": false } } 2024-04-26 17:19:20,204 - INFO - oktoberfest.utils.config::read Reading configuration from Rescoring_sage.json 2024-04-26 17:19:20,204 - INFO - oktoberfest.preprocessing.preprocessing::list_spectra Found 1 raw file in the spectra input directory. 2024-04-26 17:19:20,204 - INFO - oktoberfest.runner::_preprocess Converting search results from /HDD/sage_test2/sage_out/results.sage.tsv to internal search result. 2024-04-26 17:19:20,204 - INFO - spectrum_io.search_result.sage::read_result Reading msms.tsv file 2024-04-26 17:19:20,266 - INFO - spectrum_io.search_result.sage::read_result Finished reading msms.tsv file MODIFIED_SEQUENCE PROTEINS ... MASS SCORE 0 LTLHVGDGFEFMK sp|P19623|SPEE_HUMAN ... 1492.73840 50.368188 1 ILALC[+57.0215]MGNHELYMR sp|P26038|MOES_HUMAN;sp|P35241|RADI_HUMAN ... 1719.82600 53.612749 2 LILDVFC[+57.0215]GSQMHFVR sp|P11413|G6PD_HUMAN ... 1820.90660 50.431553 3 LLQALAQYQNHLQEQPR sp|Q9BQG0|MBB1A_HUMAN ... 2049.07570 58.124616 4 YLDEDTIYHLQPSGR sp|P31153|METK2_HUMAN ... 1805.85840 58.327720 ... ... ... ... ... ... 14805 KLRTLDHSLQK rev_sp|O15013|ARHGA_HUMAN ... 1337.77800 22.389476 14806 TEEEEEEEEEEEEDDEEEEGDDEGQK sp|Q9UKV3|ACINU_HUMAN ... 3143.08620 38.847976 14807 RDSKSEDK sp|Q70CQ4|UBP31_HUMAN ... 963.46216 23.565247 14808 LAADEDDDDDDEEDDDEDDDDDDFDDEEAEEKAPVKK sp|P06748|NPM_HUMAN ... 4245.54150 81.272737 14809 LAADEDDDDDDEEDDDEDDDDDDFDDEEAEEKAPVK sp|P06748|NPM_HUMAN ... 4117.44700 90.991145

[14810 rows x 12 columns] Index(['MODIFIED_SEQUENCE', 'PROTEINS', 'RAW_FILE', 'SCAN_NUMBER', 'CALCMASS', 'PRECURSOR_CHARGE', 'HYPERSCORE', 'REVERSE', 'SEQUENCE', 'PEPTIDE_LENGTH', 'MASS', 'SCORE'], dtype='object') 2024-04-26 17:19:20,338 - INFO - spectrum_io.search_result.search_results::filter_valid_prosit_sequences #sequences before filtering for valid prosit sequences: 14810 2024-04-26 17:19:20,357 - INFO - spectrum_io.search_result.search_results::filter_valid_prosit_sequences #sequences after filtering for valid prosit sequences: 14779 2024-04-26 17:19:20,467 - INFO - oktoberfest.runner::_preprocess Read 14779 PSMs from /HDD/sage_test2/oktresc_sage/msms/msms.prosit 2024-04-26 17:19:20,487 - INFO - oktoberfest.preprocessing.preprocessing::split_search Creating split search results file /HDD/sage_test2/oktresc_sage/msms/15032024_RZC_uPAC_DIRECT_CV55_HelaQCPromega_30min_100ng_03.rescore Waiting for tasks to complete: 0%| | 0/1 [00:00<?, ?it/s]2024-04-26 17:19:21,005 - INFO - spectrum_io.raw.thermo_raw::convert_raw_mzml Converting thermo rawfile to mzml with the command: mono /SSD/ThermoRawFileParser1.4.3/ThermoRawFileParser.exe --msLevel=2 -i /HDD/sage_test2/15032024_RZC_uPAC_DIRECT_CV55_HelaQCPromega_30min_100ng_03.raw -b /HDD/sage_test2/oktresc_sage/spectra/15032024_RZC_uPAC_DIRECT_CV55_HelaQCPromega_30min_100ng_03.mzML 2024-04-26 17:19:21 INFO Started parsing /HDD/sage_test2/15032024_RZC_uPAC_DIRECT_CV55_HelaQCPromega_30min_100ng_03.raw 2024-04-26 17:19:24 INFO Processing 23682 MS scans 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

2024-04-26 17:20:09 INFO Finished parsing /HDD/sage_test2/15032024_RZC_uPAC_DIRECT_CV55_HelaQCPromega_30min_100ng_03.raw 2024-04-26 17:20:09 INFO Processing completed 0 errors, 0 warnings 2024-04-26 17:20:09,664 - INFO - spectrum_io.raw.msraw::_read_mzml_pyteomics Reading mzML file: /HDD/sage_test2/oktresc_sage/spectra/15032024_RZC_uPAC_DIRECT_CV55_HelaQCPromega_30min_100ng_03.mzML /home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/pyteomics/xml.py:650: ResourceWarning: unclosed file <_io.BufferedReader name='/HDD/sage_test2/oktresc_sage/spectra/15032024_RZC_uPAC_DIRECT_CV55_HelaQCPromega_30min_100ng_03.mzML'> for event, elem in etree.iterparse( ResourceWarning: Enable tracemalloc to get the object allocation traceback 2024-04-26 17:20:20,187 - INFO - oktoberfest.preprocessing.preprocessing::merge_spectra_and_peptides Merging rawfile and search result 2024-04-26 17:20:20,198 - INFO - oktoberfest.preprocessing.preprocessing::merge_spectra_and_peptides There are 14779 matched identifications 2024-04-26 17:20:20,200 - INFO - oktoberfest.preprocessing.preprocessing::annotate_spectral_library Annotating spectra... Waiting for tasks to complete: 0%| | 0/1 [00:59<?, ?it/s] 2024-04-26 17:20:20,213 - ERROR - oktoberfest.utils.multiprocessing_pool::check_pool Caught Unknown exception, terminating workers 2024-04-26 17:20:20,216 - ERROR - oktoberfest.utils.multiprocessing_pool::check_pool multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/oktoberfest/runner.py", line 474, in _calculate_features library = _ce_calib(spectra_file, config) File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/oktoberfest/runner.py", line 433, in _ce_calib library = _annotate_and_get_library(spectra_file, config, tims_meta_file=tims_meta_file) File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/oktoberfest/runner.py", line 129, in _annotate_and_get_library pp.annotate_spectral_library(library, mass_tol=config.mass_tolerance, unit_mass_tol=config.unit_mass_tolerance) File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/oktoberfest/preprocessing/preprocessing.py", line 528, in annotate_spectral_library df_annotated_spectra = annotate_spectra(psms.spectra_data, mass_tol, unit_mass_tol) File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/spectrum_fundamentals/annotation/annotation.py", line 146, in annotate_spectra results = parallel_annotate(row, index_columns, mass_tolerance, unit_mass_tolerance) File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/spectrum_fundamentals/annotation/annotation.py", line 358, in parallel_annotate return _annotate_linear_spectrum(spectrum, index_columns, mass_tolerance, unit_mass_tolerance) File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/spectrum_fundamentals/annotation/annotation.py", line 385, in _annotate_linear_spectrum fragments_meta_data, tmt_n_term, unmod_sequence, calc_mass = initialize_peaks( File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/spectrum_fundamentals/fragments.py", line 119, in initialize_peaks modification_deltas = _get_modifications(sequence) File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/spectrum_fundamentals/fragments.py", line 43, in _get_modifications modification_deltas[start_pos - offset] = constants.MOD_MASSES[peptide_sequence[start_pos:end_pos]] KeyError: '[+57.0215]' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/site-packages/oktoberfest/utils/multiprocessing_pool.py", line 43, in check_pool outputs.append(res.get(timeout=10000)) # 10000 seconds = ~3 hours File "/home/thalassinoslab/miniforge3/envs/oktoberfest/lib/python3.10/multiprocessing/pool.py", line 774, in get raise self._value KeyError: '[+57.0215]'

2024-04-26 17:20:20,216 - ERROR - oktoberfest.utils.multiprocessing_pool::check_pool '[+57.0215]' WARNING:root:WARNING: Temp mmap arrays were written to /tmp/temp_mmap_g45c55w5. Cleanup of this folder is OS dependant, and might need to be triggered manually! Current space: 812,169,949,184

picciama commented 4 months ago

First, sorry for my late reply :( The problem we have with sage is that it uses mass deltas, which we need to map to UNIMOD. If the mass delta is off by only one digit, it cannot be mapped. This is not a nice way of doing it and we want to make this better by letting the user provide a dictionary of mass delta to UNIMOD within the config.

I will publish a hotfix of spectrum_fundamentals asap, where the value of C-Carbamidomethylation is changed to the exact value of the one documented in Sage, [+57.0215]. I keep you posted.

picciama commented 4 months ago

I released a new version of spectrum_fundamentals fixing this issue. Please update the dependency by running for example pip install --upgrade spectrum-fundamentals.