zavolanlab / htsinfer

Infer metadata for your downstream analysis straight from your RNA-seq data
Apache License 2.0
9 stars 22 forks source link

refactor: avoid duplicate mappings #131

Closed BorisYourich closed 9 months ago

BorisYourich commented 11 months ago

Description

Part of the code from the get_read_orientation.py module was transferred into a new module mapping.py which takes care of the STAR mapping process. The instance of the Mapping class is created in the HtsInfer class and supplemented as an argument to the get_lib_type and get_orientation modules, so as to avoid a repeated mapping if the sequence IDs are not the same and the mates are actually not mates.

Fixes #130

Type of change

Please delete options that are not relevant.

Checklist

Please carefully read these items and tick them off if the statements are true or do not apply.

One Issue which I was not able to replicate is the unit test failures in the CI workflow, I will not be able to work on this currently, so I leave it as is for now.

uniqueg commented 11 months ago

@BorisYourich: I've been away for a while. Should I look at this again or was there nothing added/changed since the last review?

BorisYourich commented 10 months ago

@BorisYourich: I've been away for a while. Should I look at this again or was there nothing added/changed since the last review?

Hey, sorry for such a late response @uniqueg I was busy with other stuff, no update on this yet.

uniqueg commented 10 months ago

No problem. Thanks for the update on no update :) I had just asked because I was away for quite a while myself and lost a bit track of progress on different ends

BorisYourich commented 10 months ago

I made the required changes plus something extra I found, but the unit tests still fail and I have no idea why, and it seems random, take a look at the replayed 3.7 unit test, many more tests failed compared to previous run. I cannot replicate in my local env, even if I downgrade python to 3.7, the tests still pass.

uniqueg commented 10 months ago

Hi @BorisYourich, I've fixed some of the errors by stabilizing the build in various ways (remote now installs the exact same package as local for me): https://github.com/zavolanlab/htsinfer/pull/131/files/0a232822922a006fd97ee9d6e94d44db733eca5d..b5bd469df2694f20ba46178425752b036cda2f94

Among other things, I have included Py3.10 in the supported versions, but dropped support for Py3.7 - support for which is phasing out anyway.

As you can see, there is still one issue remaining that affects two tests. These two tests also pass on my local machine (I suppose the same issue you faced), which -together with the error message- makes me think that it's a permission problem. Maybe a file is attempted to be written in a location that we do have write access on our local machine, but wouldn't have write access on the remote machine. Or a file is created in a temporary directory that is garbage collected on the remote between tests, but not locally.

From the error stack, I would have a look here first: https://github.com/zavolanlab/htsinfer/blob/b5bd469df2694f20ba46178425752b036cda2f94/htsinfer/mapping.py#L79-L142

This method also had a line that created a file name based off an object (which I first thought was the problem). I have now given the transcripts subset file a hardcoded name (not sure if that could cause problems in some case?). And still this stringified library source object instance is used in the log message, so this isn't handled well. Seems to me that this function needs guarding against edge cases, e.g., when there is no library source determined/available).

As I think you know that particular code better from your changes, I would like to pass it back to you if possible. With some debug messages and maybe additional try-except blocks around the code reading and writing that subset transcripts file, I'm sure you will find out what the problem is on the remote.

codecov[bot] commented 10 months ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.19% :tada:

Comparison is base (d243b44) 99.80% compared to head (a5bee43) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #131 +/- ## =========================================== + Coverage 99.80% 100.00% +0.19% =========================================== Files 12 13 +1 Lines 1047 1078 +31 =========================================== + Hits 1045 1078 +33 + Misses 2 0 -2 ``` | [Files Changed](https://app.codecov.io/gh/zavolanlab/htsinfer/pull/131?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zavolanlab) | Coverage Δ | | |---|---|---| | [htsinfer/cli.py](https://app.codecov.io/gh/zavolanlab/htsinfer/pull/131?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zavolanlab#diff-aHRzaW5mZXIvY2xpLnB5) | `100.00% <100.00%> (ø)` | | | [htsinfer/get\_library\_type.py](https://app.codecov.io/gh/zavolanlab/htsinfer/pull/131?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zavolanlab#diff-aHRzaW5mZXIvZ2V0X2xpYnJhcnlfdHlwZS5weQ==) | `100.00% <100.00%> (ø)` | | | [htsinfer/get\_read\_orientation.py](https://app.codecov.io/gh/zavolanlab/htsinfer/pull/131?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zavolanlab#diff-aHRzaW5mZXIvZ2V0X3JlYWRfb3JpZW50YXRpb24ucHk=) | `100.00% <100.00%> (+0.86%)` | :arrow_up: | | [htsinfer/htsinfer.py](https://app.codecov.io/gh/zavolanlab/htsinfer/pull/131?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zavolanlab#diff-aHRzaW5mZXIvaHRzaW5mZXIucHk=) | `100.00% <100.00%> (ø)` | | | [htsinfer/mapping.py](https://app.codecov.io/gh/zavolanlab/htsinfer/pull/131?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zavolanlab#diff-aHRzaW5mZXIvbWFwcGluZy5weQ==) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

BorisYourich commented 10 months ago

Playing around with the tests I realized where the problem is, the tests didn't have the tmpdir argument. Everything works now :)

uniqueg commented 10 months ago

Awesome, thanks a lot, @BorisYourich!

@balajtimate: Can you please test the feature branch to see if the proposed changes break anything on real world data? Maybe you can run it on a few samples and see if you get the same results as you did previously.

If all is good, I will merge.

balajtimate commented 9 months ago

Okay, so I tested on a subset of 20 samples from our sample data, random organisms, SE/PE and there was no error, code ran perfectly, and the results match the ones from the run with the dev branch. 😄

I think you can merge now, if the 3.7 integration and unit tests run.

uniqueg commented 9 months ago

Thanks! Merging this now, given that it doesn't break anything. We can still do in-depth testing later to see if it actually works as expected. Thanks a lot @BorisYourich