wfondrie / mokapot

Fast and flexible semi-supervised learning for peptide detection in Python
https://mokapot.readthedocs.io
Apache License 2.0
41 stars 15 forks source link

Multi-threaded writing of output files #107

Open gessulat opened 1 year ago

gessulat commented 1 year ago

For very large datasets, single-threaded IO operations are currently a speed bottleneck. Pyarrow datasets natively support:

Tasks

gessulat commented 1 year ago

This issue is related to the upgrade to polars https://github.com/wfondrie/mokapot/issues/89 Optimizing reading and writing could be done independently, though.

gessulat commented 1 year ago

To motivate this @sambenfredj and I did some benchmarks with a 3M PSM Mokapot input file (tab/csv) and converted it to parquet with different reader and writer implementations. Note, there are several ways how to read and write parquet files. You can write parquet files with different compression algorithms and with different reader and writer implementations: pandas, pyarrow, and polars. Within pyarrow there are again multiple options to read and write parquet files. That's why the read_speed plot is confusing, but I didn't have the time to clean it up - sorry!

Speed is always in seconds.

TL;DR:

file sizes

file_sizes

read speed

read_speed

write speed

write_speed