Add parquet and sqlite support; add NNLS-based PEP and Q-value calculation

jspaezp commented 3 months ago

I had a question on commit 466838cc07095522cc7047a571ea71a82f597477 ... in my system tests/system_tests/test_cli.py::test_cli_aggregate (https://github.com/msaid-de/mokapot/blob/c021f0906121c11b457f42e35d9979a7aff53f5c/tests/system_tests/test_cli.py#L136) fails with assert count_lines(Path(tmp_path, "blah.targets.psms")) == 10256 where the diff is 10298 == 10256. could we specify this as a range, rather than a hard-coded number (since even if the seed is the same, changes in the system/compiler can make numeric outputs unstable)

jspaezp commented 3 months ago

Just noticed that a lot of the docstrings in your functions are numpy style, whilst the project mostly uses sphinx style ... @wfondrie should we migrate to numpy style? (and probably pick what rules are enforced in ruff -> https://docs.astral.sh/ruff/rules/#pydocstyle-d )

ezander commented 3 months ago

I had a question on commit 466838c ... in my system tests/system_tests/test_cli.py::test_cli_aggregate (https://github.com/msaid-de/mokapot/blob/c021f0906121c11b457f42e35d9979a7aff53f5c/tests/system_tests/test_cli.py#L136) fails with assert count_lines(Path(tmp_path, "blah.targets.psms")) == 10256 where the diff is 10298 == 10256. could we specify this as a range, rather than a hard-coded number (since even if the seed is the same, changes in the system/compiler can make numeric outputs unstable)

How much would you think those numbers can vary? I think some quite broad range (e.g. like 9000 <= line_count <= 11000) would probably suffice here, since we're only testing for plausibility anyway. Maybe even just some lower limit would do, don't you think?

ezander commented 3 months ago

Just noticed that a lot of the docstrings in your functions are numpy style, whilst the project mostly uses sphinx style ... @wfondrie should we migrate to numpy style? (and probably pick what rules are enforced in ruff -> https://docs.astral.sh/ruff/rules/#pydocstyle-d )

I think, the original mokapot code had already numpy style docstrings instead of sphynx style. I didn't notice that when writing new functions and just used sphynx style out of habit, but I think too, that numpy is much better readable when you look directly into the code instead of generated documentation.

jspaezp commented 3 months ago

Hi @ezander !

I submitted a PR to the branch of this PR over here: https://github.com/msaid-de/mokapot/pull/42 ... its mostly chores but it handles lil things like the hard-coded number.

LMK what you think!

wfondrie commented 3 months ago

Just noticed that a lot of the docstrings in your functions are numpy style, whilst the project mostly uses sphinx style ... @wfondrie should we migrate to numpy style? (and probably pick what rules are enforced in ruff -> https://docs.astral.sh/ruff/rules/#pydocstyle-d )

I think, the original mokapot code had already numpy style docstrings instead of sphynx style. I didn't notice that when writing new functions and just used sphynx style out of habit, but I think too, that numpy is much better readable when you look directly into the code instead of generated documentation.

Perfect - I was going to request that all the docstrings be converted to Numpy-style 😄.

wfondrie commented 3 months ago

It looks like we're adding a lot of files to the test directory. At the very least, I think we need to document what they are, say by adding a README within the data directory.

The better alternative would be to move toward all of these being programmatically generated (ie simulated results), when possible. Alas, adding a bunch of real files was one of the mistakes I made early on.

jspaezp commented 2 months ago

Post Megamerge wishlist (capture as checklists the future issues detected while discussing this PR):

[ ] Brew refactor https://github.com/wfondrie/mokapot/pull/119/files#r1646576640
- [ ] Break function up
- [ ] Remove control flow based on assertions
[ ] Re-parametrize chunk size https://github.com/wfondrie/mokapot/pull/119/files#r1646580990
[ ] Strict zip pre-3.10 https://github.com/wfondrie/mokapot/pull/119/files#r1646583534
[ ] Scaler weight loading https://github.com/wfondrie/mokapot/pull/119/files#r1646591660

gessulat commented 2 months ago

Post Megamerge wishlist (capture as checklists the future issues detected while discussing this PR):

[x] Brew refactor https://github.com/wfondrie/mokapot/pull/119/files#r1646576640 Issue: https://github.com/wfondrie/mokapot/issues/110

[x] Re-parametrize chunk size https://github.com/wfondrie/mokapot/pull/119/files#r1646580990 Issue https://github.com/wfondrie/mokapot/issues/123

[x] Strict zip pre-3.10 https://github.com/wfondrie/mokapot/pull/119/files#r1646583534 - will be solved through upgrade to Python 3.10 as discussed with @wfondrie

[x] Scaler weight loading https://github.com/wfondrie/mokapot/pull/119/files#r1646591660 Issue: https://github.com/wfondrie/mokapot/issues/124

[x] remove MSAID's sqlite support https://github.com/wfondrie/mokapot/pull/119#discussion_r1646586660 Issue: https://github.com/wfondrie/mokapot/issues/120
[x] input and output data frame specifications + separate converters for non-conforming input formats https://github.com/wfondrie/mokapot/pull/119#discussion_r1659262096 Issues: https://github.com/wfondrie/mokapot/issues/109 https://github.com/wfondrie/mokapot/issues/108
[x] consolidate naming of LinearPsmDataset and OnDiskPsmDataset classes. Issue: https://github.com/wfondrie/mokapot/issues/122
[x] revert changing suffixes to 'psms' to 'psms.txt' (and other levels as well) https://github.com/wfondrie/mokapot/pull/119#discussion_r1681072180 Issue: https://github.com/wfondrie/mokapot/issues/121

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 86.35671% with 179 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@af44229). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
mokapot/tabular_data.py	84.52%	54 Missing :warning:
mokapot/parsers/pin_to_tsv.py	39.34%	37 Missing :warning:
mokapot/streaming.py	87.50%	20 Missing :warning:
mokapot/mokapot.py	65.85%	14 Missing :warning:
mokapot/brew_rollup.py	92.69%	13 Missing :warning:
mokapot/peps.py	71.11%	13 Missing :warning:
mokapot/dataset.py	85.96%	8 Missing :warning:
mokapot/confidence.py	95.85%	7 Missing :warning:
mokapot/confidence_writer.py	88.67%	6 Missing :warning:
mokapot/brew.py	84.61%	2 Missing :warning:
... and 5 more

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #119 +/- ## ======================================= Coverage ? 84.60% ======================================= Files ? 26 Lines ? 2949 Branches ? 0 ======================================= Hits ? 2495 Misses ? 454 Partials ? 0 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

wfondrie / mokapot

Add parquet and sqlite support; add NNLS-based PEP and Q-value calculation #119

Codecov Report