wilhelm-lab / oktoberfest

Rescoring and spectral library generation pipeline for proteomics.
MIT License
29 stars 8 forks source link

ERROR - spectrum_io.file.hdf5::write_dataset value too large to convert to int #199

Open colin986 opened 5 months ago

colin986 commented 5 months ago

Hi @picciama

I'm using Oktoberfest to generate a prosit library using a semi tryptic library of uniprot proteome. Here is the output including the error.

2024-02-06 08:43:03,224 - INFO - oktoberfest.utils.config::read Reading configuration from /data/scribe_analysis/spectral_library_config.json 2024-02-06 08:43:51,924 - INFO - spectrum_io.spectral_library.digest::get_peptide_to_protein_map Digesting protein 10000 2024-02-06 08:44:39,633 - INFO - spectrum_io.spectral_library.digest::get_peptide_to_protein_map Digesting protein 20000 2024-02-06 09:10:19,315 - INFO - oktoberfest.preprocessing.preprocessing::process_and_filter_spectra_data No of sequences before filtering is 106425486 2024-02-06 09:21:45,798 - INFO - oktoberfest.preprocessing.preprocessing::process_and_filter_spectra_data No of sequences after filtering is 80469771 2024-02-06 09:43:23,306 - ERROR - spectrum_io.file.hdf5::write_dataset value too large to convert to int Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/spectrum_io/file/hdf5.py", line 123, in write_dataset data.to_hdf(path, key=dataset_name, mode=mode, complib=compression) File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 2799, in to_hdf pytables.to_hdf( File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 301, in to_hdf f(store) File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 283, in f = lambda store: store.put( File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 1123, in put self._write_to_group( File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 1776, in _write_to_group s.write( File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 3256, in write self.write_array(f"block{i}_values", blk.values, items=blk_items) File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 3100, in write_array vlarr.append(value) File "/usr/local/lib/python3.8/site-packages/tables/vlarray.py", line 528, in append self._append(nparr, nobjects) File "tables/hdf5extension.pyx", line 2029, in tables.hdf5extension.VLArray._append OverflowError: value too large to convert to int 2024-02-06 09:43:23,306 - ERROR - spectrum_io.file.hdf5::write_dataset value too large to convert to int Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/spectrum_io/file/hdf5.py", line 123, in write_dataset data.to_hdf(path, key=dataset_name, mode=mode, complib=compression) File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 2799, in to_hdf pytables.to_hdf( File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 301, in to_hdf f(store) File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 283, in f = lambda store: store.put( File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 1123, in put self._write_to_group( File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 1776, in _write_to_group s.write( File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 3256, in write self.write_array(f"block{i}_values", blk.values, items=blk_items) File "/usr/local/lib/python3.8/site-packages/pandas/io/pytables.py", line 3100, in write_array vlarr.append(value) File "/usr/local/lib/python3.8/site-packages/tables/vlarray.py", line 528, in append self._append(nparr, nobjects) File "tables/hdf5extension.pyx", line 2029, in tables.hdf5extension.VLArray._append OverflowError: value too large to convert to int 2024-02-06 09:43:23,899 - INFO - spectrum_io.file.hdf5::write_dataset Data appended to /data/scribe_analysis/nibrt/prosit_library/data/prosit_input_filtered.hdf5 2024-02-06 09:43:24,006 - INFO - spectrum_io.file.hdf5::write_dataset Data appended to /data/scribe_analysis/nibrt/prosit_library/data/prosit_input_filtered.hdf5 Writing library: 1%| | 73/8047 03:18<6:02:05, 2.72s/it, missing=0, successful=73 colin@bioinfo-wstation-02:/mnt/HDD2/colin/ribosome_footprint_profiling$ Getting predictions: 1%|▏ | 103/8047 [03:17<4:13:41, 1.92s/it, failed=0, successful=103]

This error is not fatal and Oktoberfest has run to completion. I have run a similar spectral library generation and one thing I've noticed is a significant reduction in the file size of the spectral library.

My config file: { "type": "SpectralLibraryGeneration", "tag": "", "output": "/data/prosit_library", "inputs": { "library_input": "/data/proteome.fasta", "library_input_type": "fasta" }, "models": { "intensity": "Prosit_2020_intensity_HCD", "irt": "Prosit_2019_irt" }, "spectralLibraryOptions": { "fragmentation": "HCD", "collisionEnergy": 29, "precursorCharge": [2,3,4], "minIntensity": 5e-4, "batchsize": 10000, "format": "msp" }, "fastaDigestOptions": { "digestion": "semi", "missedCleavages": 2, "minLength": 7, "maxLength": 60, "enzyme": "trypsin", "specialAas": "KR", "db": "target" }, "prediction_server": "koina.proteomicsdb.org:443", "numThreads": 30, "ssl": true }

Any help appreciated, Colin

picciama commented 5 months ago

Hi @colin986 ,

this seems to be related to having very large dataframes, like in this case, with >80 million entries and happens when Oktoberfest tries to save the digested and filtered peptides to disk.

This is purely designed to save time when rerunning the library generation, for example, in case some predictions failed and you want to append them to the existing library file in a second attempt without having to rerun everything.

In this case, with a presumably corrupt hdf file now, if you see no failed batches, you are fine. In case some batches failed, let me know and I try to find a solution for you so that you don't need to rerun everything.

I will check this in more detail and try to reproduce and fix it.

Concerning your second observation: I understand this in relation to runs you have done with a previous Oktoberfest version. Then, the reduction is primarily due to filtering out peaks with predicted normalized intensity < 1e-5, and rounding values to specific number of fraction digits depending on the type of value.

Just out of curiosity: Can you tell me how much disk space is saved for these large libraries and how long this is running now with version 0.6.0 in comparison to the previous Oktoberfest version?

colin986 commented 5 months ago

Thanks @picciama

With previous versions library generation took over 2 days and the MSP file was approx 250GB. With 0.6.0 the time to generate decreases to about 8hrs. At the default intensity (5e-4) the file size reduces to 80GB and completes in 5hrs.

In terms of failed batches I haven't seen any since you updated Oktoberfest.