fix: infer library type relationship

balajtimate commented 6 months ago

Description

Add check to assign first_mate and second_mate when it can correctly be inferred from seq_id's (when seq_ids are not empty)
Add check to only align when library source is either inferred or given by --tax-id argument (in case of library type as well as read orientation inference)
Add --outSAMorder PairedKeepInputOrder to mapping.py to keep input order of reads when running STAR with multiple threads
Update tests

In the case of paired-end libraries, the logic should be:

For the library type inference, check seq_ids of first and second pair, to decide if they fit the Casava format, and the lib type and relationship can be determined
If not, as in most SRA samples, align them separately to reference transcripts, but only if library_source is known
Compare alignments and decide relationship
For the orientation inference, based on the library type relationship determined:
1. If not_mates or not_available, infer the orientation from the separately aligned results
2. If split_mates, align the samples in paired-end mode, but only if library_source is known

Fixes #153

Type of change

[X] New feature (non-breaking change which adds functionality)

Checklist

Please carefully read these items and tick them off if the statements are true or do not apply.

[X] I have performed a self-review of my own code
[X] My code follows the existing coding style, lints and generates no new warnings
[X] I have added type annotations to all function/method signatures, and I have added type annotations for any local variables that are non-trivial, potentially ambiguous or might otherwise benefit from explicit typing.
[X] I have commented my code in hard-to-understand areas
[X] I have added ["Google-style docstrings"] to all new modules, classes, methods/functions or updated previously existing ones
[X] I have added tests that prove my fix is effective or that my feature works
[X] New and existing unit tests pass locally with my changes and I have not reduced the code coverage relative to the previous state
[X] I have updated any sections of the app's documentation that are affected by the proposed changes

If for some reason you are unable to tick off all boxes, please leave a comment explaining the issue you are facing so that we can work on it together.

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (930b741) 100.00% compared to head (8fe0cbe) 100.00%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## dev #157 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 13 13 Lines 1109 1131 +22 ========================================= + Hits 1109 1131 +22 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

balajtimate commented 5 months ago

To summarize: refactored _align_mates in get_library_type.py to correctly determine the number of aligned reads as well as the number of concordant reads.

When iterating through the reads, the problem during _compare_alignments was that the order of the reference names were scrambled. Turns out, this was the result of STAR: it outputs the aligned reads in the same order as the input read, except when running it on multiple threads, which I have been doing since day 1. When I reran the samples on the default 1 thread, the output of the alignment matched between the two samples, and the concordant reads were calculated correctly. So one solution was to sort reads by ref name using pysam, which outputs an extra sorted BAM file, but ultimately updating the STAR command with --outSAMorder PairedKeepInputOrder seems to solve this, so there is no extra sorting needed, the output will be in the same order as input (it's also possible to sort the aligned reads with STAR, but only according to coordinates, and I think it'd be better to have sorted according to ref names. Btw, is there any advantage in this case for using SAMs instead of BAMs? I think mapping.py could be updated to output BAMs, maybe in a separate PR).

For checking the ratio between the concordant and the aligned reads (for _update_relationship), I choose the lowest of the two alignments. Also, I'll run it on more samples and check the ratio, but I think we could lower it a bit, like 90% (instead of 95% default now) for the ratio / aligned reads to be considered paired end library.

uniqueg commented 5 months ago

Oh, and please update the PR description to reflect all of the latest changes. As far as I understood, you also fixed some things that never really worked.

zavolanlab / htsinfer