wilhelm-lab / oktoberfest

Rescoring and spectral library generation pipeline for proteomics.
MIT License
29 stars 8 forks source link

Missing ProteinID in the generated Spectral Lib and timsTOF questions #238

Closed ZhaoYaniBruker closed 1 month ago

ZhaoYaniBruker commented 1 month ago

Dear developers,

I have used Oktoberfest to generate a Spectral Library from a .fasta file, however, there is no proteinID information in this Lib. I have checked the settings few times, still didn't get how to select this option. Could you please give a little hint? Thanks very much for your time.

Best regards.

picciama commented 1 month ago

This is indeed not yet published but already close to completion. I will try to release this asap (hopefully this week).

ZhaoYaniBruker commented 1 month ago

Dear Picciama,

Thank you very much for your reply. I am looking forward to see the new release.

I have another little question: I used "models": {"intensity": "Prosit_2020_intensity_HCD", "irt": "Prosit_2019_irt"}, "prediction_server": "koina.wilhelmlab.org:443" to generate the Spectral library. I am not sure if the modifications were included in the Prosit models, I mean these generally used modifications like "Ac(N-term)", "Ox(M)", and "C(carbo...)". Or, I have to manually setup these by myself.

Thanks again for your reply.

picciama commented 1 month ago

The way Oktoberfest works is that it applies static carbamidomethylation to C and variable oxidation to M. These are the only modifications that are supported by the Prosit models, which is why there is no additional configuration for this. The only other modificaitons is TMT if you specify the tag in the config and use a TMT-Prosit model. This is because of how Prosit models were trained. N-term acetylated peptides are overall not so common, which led to the decision not to support it back in the days, as the number of identified proteins didn't increase because of that. Right now, if you have peptides with n-actylation or any other modification, they are filtered out. You should see the number of filtered peptides in the log file, where it reads "Number of peptides before / after filtering". We will start to support other models, such as AlphaPept, DeepLC, and MS2PIP starting with the next release (0.7.0) which may support additional modifications, and will add the possibility to customize the modifications in the config accordingly. In addition, the new Prosit-PTM model is worked on, which is theoretically supporting any modification one could think of.

ZhaoYaniBruker commented 1 month ago

Dear Picciama,

Congratulations on the new release :D. I will try it out soon.

1) Actually, I find another problem when I analyze the generated spectral library from the old version. I have set the peptide length ranges from 7 to 50, but I only got peptides with lengths up to 30. Please find the attached config. I wonder if this is due to the collisionEnergy setting, or it is a bug? Screenshot 2024-07-16 133847

2) I am also interested in the rescoring, and I don't have much knowledge about it. It looks like the Oktoberfest rescore can only work on DDA data (am I right?), are there any software tools or any ways to rescore DIA data too?

With best regards.

picciama commented 1 month ago

Release is not yet fully ready, I have released new versions of the dependency packages and try to test this now before officially releasing it. Regarding your questions:

  1. Depending on the prediction model you choose, there are limitations wrt. to the peptides that are supported. This is due to limitations that arise from the model architecture. Prosit for instance, was designed to capture as many peptides as possible while maintaining fast inference and limiting its complexity. This is why it was designed to support peptides with a length of 7-30 aas, as most peptides we see fall into this range. This is why Oktoberfest filters out longer (and shorter) peptides. There are other models that support longer peptides. Below is a list of model specifications:

    prosit
            "min_length": 7,
            "max_length": 30,
            "max_charge": 6,
    ms2pip
            "min_length": 2,
            "max_length": 100,
            "max_charge": 6,
    alphapept
            "min_length": 7,
            "max_length": 35,
            "max_charge": 4,

    While you can use ms2pip, as it supports peptides up to a length of 100, that comes with lower spectral angles between predicted and observed spectra based on our experience, part of the reason is that ms2pip does not support CE calibration. So it is unfortunately always a tradeoff, based on what models are currently available.

  2. You are correct, rescoring is done for DDA data. For DIA data, one would typically create spectral libraries that can then be used with external tools, such as Skyline, DIA-NN, or spectronaut. Also rescoring DIA results is a whole different beast because of chimeric spectra. Rescoring doesn't do anything for you here, since the main idea behind rescoring is using the predicted intensities and this is already there in the predicted spectral libraries. FragPipe also has a DIA mode: https://fragpipe.nesvilab.org/docs/tutorial_DIA.html. The folks at Nesvilab have also integrated online prediction support using our Koina prediction server, so that gives them access to the same models Oktoberfest has access to (I am not sure if that works in DIA mode too though).

ZhaoYaniBruker commented 1 month ago

Dear Mario,

I am very grateful for your quick and nice reply and comments.

Thanks for pointing out the peptide length limits, that is really useful for benchmarking.

Thanks for your answer, I now understand more about rescoring. Regarding to the modifications in rescoring, I think I understand now: it depends on the parameters one sets while using MaxQuant :D.

With best regards.

picciama commented 1 month ago

Regarding variable n-acetylation and unmodified cystein: We never trained models with this, because ProteomeTools, from where the synthetic peptides for training are taken, does not contain such peptides. Therefore, Oktoberfest automatically assumes that cysteins are carbamidomethylated and filters out n-acetylated peptides.

In the timsTOF paper, we used data from Fossati et al. and merely described how they were searching it. Therefore, n-acetylated peptides were filtered out automatically in Oktoberfest. For comparison of Orbitrap versus timsTOF rescoring, we needed a dataset that measured HLA peptides on both instruments, so we chose the Gravel et al. dataset. Because they were not performing carbamidomethylation, we manually filtered out peptides containing cystein prior to rescoring.

Therefore, Oktoberfest always assumes C-carb as a fixed and M-ox as a variable modification and also doesn't support anything else, except TMT mods for the TMT-capable models. However, the new Prosit-PTM model will change that, and so a future Oktoberfest version will also support additional, model dependent modifications, which can then be specified in the configuration by the user.

picciama commented 1 month ago

Thanks for your answer, I now understand more about rescoring. Regarding to the modifications in rescoring, I think I understand now: it depends on the parameters one sets while using MaxQuant :D.

Be careful, the modifications do not depend on what you set in MQ. You can obviously search with many modifications in MQ, but the limiting factor are available prediction models. So Oktoberfest (and any other pipeline) will always filter out everything that is not supported by the model!

ZhaoYaniBruker commented 1 month ago

I have tested oktoberfest rescoring on MQ, works perfect. When I move to Sage, it however gives some errors. The config is: Screenshot 2024-07-17 195037

The error is: Screenshot 2024-07-17 195055

If I switch to .mzML and run sage again, oktoberfest gives different errors: Screenshot 2024-07-17 202835

What would be the cause here? I am looking forward to seeing your answer.

picciama commented 1 month ago

Oktoberfest only supports MQ when rescoring timsTOF data. The challenge is how raw spectra are aggregated to MS2 by the search engine and how the search engine hits can be mapped back to the raw spectra in the d folder, to aggregate them as well to get the MS2 level intensities. We implemented this for MQ, and support for MSFragger is planned. For Sage, we never tested this and I would need to talk to lazaer (Sage maintainer) to see how we can do this.

Oktoberfest should actually stop and tell you that you can only use spectra_type="d" in combination with search_engine="maxquant", but I will need to check again to see why it didn't do that in your case.

I also have the suspicion, that sth. in the Sage output may have changed between versions, which may lead to the first error message you receive. Could you please tell me which version of Sage you were using and ideally also send me the first two rows (header + 1 PSM row)?

The second error most likely arises from the fact that our mzml parser was never intended to be used with timsTOF data. We rely on converting .d to hdf using alphatims, as it allows to read only specific raw spectra for which we actually have hits by the search engine, thus avoiding huge memory explosions when dealing with large datasets, once the d folder was converted to hdf at least once. I don't know how best to support mzml for timsTOF case, because I am not sure exactly how MS2 level spectra would be stored and never looked into it. I also don't know if mzml is generally the preferred way over supplying the d folder or an already converted hdf file.

I hope that my explanations helped you at least understand why not all things work (yet).

ZhaoYaniBruker commented 1 month ago

Dera Mario,

I have Sage version 0.14.4. (PS: I downloaded sage v0.14.7, the newest one, in case the version is the problem).

Few lines of Sage output are attached here: peptide proteins num_proteins filename scannr rank label expmass calcmass charge peptide_len missed_cleavages isotope_error precursor_ppm fragment_ppm hyperscore delta_next delta_best rt aligned_rt predicted_rt delta_rt_model matched_peaks longest_b longest_y longest_y_pct matched_intensity_pct scored_candidates poisson sage_discriminant_score posterior_error spectrum_q peptide_q protein_q ms1_intensity ms2_intensity VVLLGEFLHPC[+57.0216]EDDIVC[+57.0216]K sp|Q9NY12|GAR1_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.d 60746 1 1 2142.0562 2142.049 3 18 0 0.0 3.3052773 8.457472 37.94702352526845 37.94702352526845 0.0 15.076681 0.75383407 0.7563776 0.002543509 13 1 7 0.3888889 37.331467 384 -9.939175662369385 0.9891276 -63.95509 0.000057257374 0.000098721976 0.00031930223 59147.0 74732.0 EAIDSPVSFLALHNQIR sp|Q92900|RENT1_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.d 54418 1 1 1909.02 1909.0059 3 17 0 0.0 7.4175277 4.5374055 29.008742489531706 29.008742489531706 0.0 14.105062 0.70525306 0.7025821 0.0026709437 11 07 0.4117647 16.584265 433 -7.907185630908808 0.9714641 -56.83898 0.000057257374 0.000098721976 0.00031930223 16553.0 11103.0 (base) yani@r2d2:~/public_sage$ head -5 results.sage.tsv peptide proteins num_proteins filename scannr rank label expmass calcmass charge peptide_len missed_cleavages isotope_error precursor_ppm fragment_ppm hyperscore delta_next delta_best rt aligned_rt predicted_rt delta_rt_model matched_peaks longest_b longest_y longest_y_pct matched_intensity_pct scored_candidates poisson sage_discriminant_score posterior_error spectrum_q peptide_q protein_q ms1_intensity ms2_intensity VVLLGEFLHPC[+57.0216]EDDIVC[+57.0216]K sp|Q9NY12|GAR1_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.d 60746 1 1 2142.0562 2142.049 3 18 0 0.0 3.3052773 8.457472 37.94702352526845 37.94702352526845 0.0 15.076681 0.75383407 0.7563776 0.002543509 13 1 7 0.3888889 37.331467 384 -9.939175662369385 0.9891276 -63.95509 0.000057257374 0.000098721976 0.00031930223 59147.0 74732.0 EAIDSPVSFLALHNQIR sp|Q92900|RENT1_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.d 54418 1 1 1909.02 1909.0059 3 17 0 0.0 7.4175277 4.5374055 29.008742489531706 29.008742489531706 0.0 14.105062 0.70525306 0.7025821 0.0026709437 11 07 0.4117647 16.584265 433 -7.907185630908808 0.9714641 -56.83898 0.000057257374 0.000098721976 0.00031930223 16553.0 11103.0 MEEDPDDVPHGHITSLAVK sp|P41227|NAA10_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.d 25936 1 1 2088.989 2088.9785 319 0 0.0 5.025432 7.1123247 32.84235456319327 32.84235456319327 0.0 9.396451 0.46982256 0.47488138 0.005058825 11 17 0.36842105 17.35257 206 -8.171445349046774 0.96620446 -54.80788 0.000057257374 0.000098721976 0.00031930223 37912.0 21542.0 YALQMEQLNGILLHLESELAQTR sp|P05783|K1C18_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.d 75832 1 1 2669.3982 2669.3848 323 0 0.0 5.03026 6.7081027 37.011977401525314 22.424393154556977 0.0 17.393608 0.8696804 0.88561666 0.015936255 13 1 70.3043478 20.705421 176 -10.277421757230503 0.95839936 -51.88696 0.000057257374 0.000098721976 0.00031930223 121570.0 34600.0

picciama commented 1 month ago

Indeed, sth. I have missed. I implemented a fix for this which will appear in the next spectrum_io patch release, which oktoberfest will make use of then. I am close to releasing oktoberfest 0.7.0 and this issue will then be closed.

ZhaoYaniBruker commented 1 month ago

Dear Mario,

Thanks a lot for your reply. Just let you know that there are some new features (for example --annotate-matches tag) in the newest Sage version (I downloaded sage-v0.14.7-x86_64-unknown-linux-gnu.tar.gz). It has two output .tsv table: results.sage.tsv and matched_fragments.sage.tsv. So I give you the output from the new version.

  1. results.sage.tsv psm_id peptide proteins num_proteins filename scannr rank label expmass calcmass charge peptide_len missed_cleavages semi_enzymatic isotope_error precursor_ppm fragment_ppm hyperscore delta_next delta_best rt aligned_rt predicted_rt delta_rt_model ion_mobility predicted_mobility delta_mobility matched_peaks longest_b longest_y longest_y_pct matched_intensity_pct scored_candidates poisson sage_discriminant_score posterior_error spectrum_q peptide_q protein_q ms2_intensity 29273 YLEGTSC[+57.0215]IAGVLVPAK sp|P58107|EPIPL_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.mzML merged=67730 frame=9808 scanStart=324 scanEnd=349 1 1 1676.8995 1676.8807 2 16 0 0 0.0 11.21053 4.413039 55.44661267131702 37.46982356071727 0.0 13.33591 0.6667955 0.68078184 0.013986349 1.1046195 1.1158504 0.011230946 21 4 13 0.8125 24.208618 633 -14.973581538062605 2.017023 -140.70984 0.000048423804 0.000068300775 0.00027366666 56898.0 28725 HHAAYVNNLNVTEEK sp|P04179|SODM_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.mzML merged=26332 frame=5448 scanStart=314 scanEnd=339 11 1737.847 1737.8435 2 15 0 0 0.0 2.0370278 7.1284356 60.485192698584534 36.52151649401785 0.0 7.4260545 0.37130272 0.36828455 0.0030181706 1.1211218 1.1223476 0.0012258291 22 10 12 0.8 11.020701 957 -16.823164371070106 1.9917756-131.36168 0.000048423804 0.000068300775 0.00027366666 199234.0 27191 HQALQAEIAGHEPR sp|Q13813|SPTN1_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.mzML merged=25897 frame=5394 scanStart=316 scanEnd=341 11 1555.7922 1555.7855 2 14 0 0 0.0 4.31541 5.4111695 48.59479862181335 30.385308433787124 0.0 7.3540106 0.36770052 0.3670064 0.0006941259 1.114062 1.0862592 0.027802706 19 6 11 0.78571427 22.307968 969 -14.683957928473985 1.9865242-129.4552 0.000048423804 0.000068300775 0.00027366666 19223.0 39256 DHASIQMNVAEVDK sp|P63220|RS21_HUMAN 1 B_20230927_IO15x75_HeLa_400ng_15min_PASEF_06to15_Slot2-8_1_6487.mzML merged=39317 frame=6904 scanStart=360 scanEnd=385 11 1555.7334 1555.7301 2 14 0 0 0.0 2.118552 7.840434 59.21073138720013 34.622386632051374 0.0 9.401886 0.4700943 0.46926263 0.0008316636 1.0446446 1.0567967 0.012152076 21 10 11 0.78571427 21.433601 593 -16.111041926119043 1.9643503 -121.62653 0.000048423804 0.000068300775 0.00027366666 375617.0

  2. matched_fragments.sage.tsv psm_id fragment_type fragment_ordinals fragment_charge fragment_mz_calculated fragment_mz_experimental fragment_intensity 29273 b 2 1 277.15466 277.15625 3170.0 29273 b 3 1 406.19727 406.1998 2086.0 29273 b 4 1 463.21872 463.22006 757.0 29273 b 5 1 564.2664 564.2696 1088.0

With best regards

picciama commented 1 month ago

This is valueable information! We are currently relying on matching the peaks of the raw spectra to the theoretical monoisotopic masses of each fragment ion to compare predicted against observed peaks, so this could make the spectra annotation much faster, at least for Sage! I will add it to my feature list.