wfondrie / mokapot

Fast and flexible semi-supervised learning for peptide detection in Python
https://mokapot.readthedocs.io
Apache License 2.0
40 stars 14 forks source link

File schema via pydantic #109

Open gessulat opened 9 months ago

gessulat commented 9 months ago

The PsmSchema definition is currently implemented via dataclasses. It's great to have the ability to validate a dataframe with a schema!

Pydantic is a library for defining schemas and validation that might offer additional useful functionality. It might be overkill for the use case of Mokapot but I think it's worth evaluating. I found these two articles showcasing how it could be done.

In case this is useful: Tasks

jspaezp commented 8 months ago

[Full disclosure, I love pydantic and use it for json validation all the time]

In principle I really like this idea! Although I am not sure exactly where and how much validation would need to happen within mokapot in a way that would require exntensibility via pydantic. Would you mine elaborating on the use case/api you have in mind for it?

just FTI, for data frame validation I have been using this project https://docs.dagster.io/integrations/pandas, and really like the syntax they use for validation. (https://github.com/unionai-oss/pandera and https://github.com/JakobGM/patito are alternatives I have evaluated as well)

gessulat commented 8 months ago

Sorry that the context was missing! This idea came up in a discussion with @wfondrie. Internally, we use schemas and validators a lot for various things, mostly for API definitions and complex configuration files (e. g. validating Sage configs).

I just noticed that currently validation on what defines a PsmDataset is implemented via data classes and pydantic would be one option for a generalized validation based on schemas, that might be also useful to validate others. One could image for example that instead of specifying flags via the command line (which might be cumbersome with a large set of flags that have dependencies and interactions) parameters could be specified in a configuration file as configuration files offer more flexibility to express parameter dependencies. In that case Pydantic could be used in a similar way for both: validating internal data structures and exposed APIs.

Dagsters type definitions also look good to me but I only skimmed to documentation. I assume they are specific to dataframes and don't generalize to more generic data structures, correct? If you intend to use dagster as a dependency in Mokapot anyway, this could be a great fit, but if it's only for validation, it feels like an out-of-place dependency to me.

Both dagster and pydantic seem to be good choices to me. It basically depends on whether a) pydantic schemas might be valuable in other places in the future, or b) if other dagster functionality would valuable in the future. You definitely have a better feeling for that ;)

jspaezp commented 8 months ago

thanks for the context!

Just for the record the dagster reference was more regarding the interface than the actual implemeting package. I would love to go more into detail regarding the implementation details once we get the "mega-merge" done on the current development version of the project!

Best!