wfondrie / mokapot

Fast and flexible semi-supervised learning for peptide detection in Python
https://mokapot.readthedocs.io
Apache License 2.0
41 stars 15 forks source link

Add parquet and sqlite support; add NNLS-based PEP and Q-value calculation #119

Closed gessulat closed 3 weeks ago

jspaezp commented 3 months ago

I had a question on commit 466838cc07095522cc7047a571ea71a82f597477 ... in my system tests/system_tests/test_cli.py::test_cli_aggregate (https://github.com/msaid-de/mokapot/blob/c021f0906121c11b457f42e35d9979a7aff53f5c/tests/system_tests/test_cli.py#L136) fails with assert count_lines(Path(tmp_path, "blah.targets.psms")) == 10256 where the diff is 10298 == 10256. could we specify this as a range, rather than a hard-coded number (since even if the seed is the same, changes in the system/compiler can make numeric outputs unstable)

jspaezp commented 3 months ago

Just noticed that a lot of the docstrings in your functions are numpy style, whilst the project mostly uses sphinx style ... @wfondrie should we migrate to numpy style? (and probably pick what rules are enforced in ruff -> https://docs.astral.sh/ruff/rules/#pydocstyle-d )

ezander commented 3 months ago

I had a question on commit 466838c ... in my system tests/system_tests/test_cli.py::test_cli_aggregate (https://github.com/msaid-de/mokapot/blob/c021f0906121c11b457f42e35d9979a7aff53f5c/tests/system_tests/test_cli.py#L136) fails with assert count_lines(Path(tmp_path, "blah.targets.psms")) == 10256 where the diff is 10298 == 10256. could we specify this as a range, rather than a hard-coded number (since even if the seed is the same, changes in the system/compiler can make numeric outputs unstable)

How much would you think those numbers can vary? I think some quite broad range (e.g. like 9000 <= line_count <= 11000) would probably suffice here, since we're only testing for plausibility anyway. Maybe even just some lower limit would do, don't you think?

ezander commented 3 months ago

Just noticed that a lot of the docstrings in your functions are numpy style, whilst the project mostly uses sphinx style ... @wfondrie should we migrate to numpy style? (and probably pick what rules are enforced in ruff -> https://docs.astral.sh/ruff/rules/#pydocstyle-d )

I think, the original mokapot code had already numpy style docstrings instead of sphynx style. I didn't notice that when writing new functions and just used sphynx style out of habit, but I think too, that numpy is much better readable when you look directly into the code instead of generated documentation.

jspaezp commented 3 months ago

Hi @ezander !

I submitted a PR to the branch of this PR over here: https://github.com/msaid-de/mokapot/pull/42 ... its mostly chores but it handles lil things like the hard-coded number.

LMK what you think!

wfondrie commented 3 months ago

Just noticed that a lot of the docstrings in your functions are numpy style, whilst the project mostly uses sphinx style ... @wfondrie should we migrate to numpy style? (and probably pick what rules are enforced in ruff -> https://docs.astral.sh/ruff/rules/#pydocstyle-d )

I think, the original mokapot code had already numpy style docstrings instead of sphynx style. I didn't notice that when writing new functions and just used sphynx style out of habit, but I think too, that numpy is much better readable when you look directly into the code instead of generated documentation.

Perfect - I was going to request that all the docstrings be converted to Numpy-style 😄.

wfondrie commented 3 months ago

It looks like we're adding a lot of files to the test directory. At the very least, I think we need to document what they are, say by adding a README within the data directory.

The better alternative would be to move toward all of these being programmatically generated (ie simulated results), when possible. Alas, adding a bunch of real files was one of the mistakes I made early on.

jspaezp commented 2 months ago

Post Megamerge wishlist (capture as checklists the future issues detected while discussing this PR):

gessulat commented 2 months ago

Post Megamerge wishlist (capture as checklists the future issues detected while discussing this PR):

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 86.35671% with 179 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@af44229). Learn more about missing BASE report.

Files with missing lines Patch % Lines
mokapot/tabular_data.py 84.52% 54 Missing :warning:
mokapot/parsers/pin_to_tsv.py 39.34% 37 Missing :warning:
mokapot/streaming.py 87.50% 20 Missing :warning:
mokapot/mokapot.py 65.85% 14 Missing :warning:
mokapot/brew_rollup.py 92.69% 13 Missing :warning:
mokapot/peps.py 71.11% 13 Missing :warning:
mokapot/dataset.py 85.96% 8 Missing :warning:
mokapot/confidence.py 95.85% 7 Missing :warning:
mokapot/confidence_writer.py 88.67% 6 Missing :warning:
mokapot/brew.py 84.61% 2 Missing :warning:
... and 5 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #119 +/- ## ======================================= Coverage ? 84.60% ======================================= Files ? 26 Lines ? 2949 Branches ? 0 ======================================= Hits ? 2495 Misses ? 454 Partials ? 0 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.