zavolanlab / htsinfer

Infer metadata for your downstream analysis straight from your RNA-seq data
Apache License 2.0
9 stars 22 forks source link

fix: infer library type relationship #157

Closed balajtimate closed 5 months ago

balajtimate commented 6 months ago

Description

In the case of paired-end libraries, the logic should be:

  1. For the library type inference, check seq_ids of first and second pair, to decide if they fit the Casava format, and the lib type and relationship can be determined
  2. If not, as in most SRA samples, align them separately to reference transcripts, but only if library_source is known
  3. Compare alignments and decide relationship
  4. For the orientation inference, based on the library type relationship determined:
    1. If not_mates or not_available, infer the orientation from the separately aligned results
    2. If split_mates, align the samples in paired-end mode, but only if library_source is known

Fixes #153

Type of change

Checklist

Please carefully read these items and tick them off if the statements are true or do not apply.

If for some reason you are unable to tick off all boxes, please leave a comment explaining the issue you are facing so that we can work on it together.

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (930b741) 100.00% compared to head (8fe0cbe) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #157 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 13 13 Lines 1109 1131 +22 ========================================= + Hits 1109 1131 +22 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

balajtimate commented 5 months ago

To summarize: refactored _align_mates in get_library_type.py to correctly determine the number of aligned reads as well as the number of concordant reads.

When iterating through the reads, the problem during _compare_alignments was that the order of the reference names were scrambled. Turns out, this was the result of STAR: it outputs the aligned reads in the same order as the input read, except when running it on multiple threads, which I have been doing since day 1. When I reran the samples on the default 1 thread, the output of the alignment matched between the two samples, and the concordant reads were calculated correctly. So one solution was to sort reads by ref name using pysam, which outputs an extra sorted BAM file, but ultimately updating the STAR command with --outSAMorder PairedKeepInputOrder seems to solve this, so there is no extra sorting needed, the output will be in the same order as input (it's also possible to sort the aligned reads with STAR, but only according to coordinates, and I think it'd be better to have sorted according to ref names. Btw, is there any advantage in this case for using SAMs instead of BAMs? I think mapping.py could be updated to output BAMs, maybe in a separate PR).

For checking the ratio between the concordant and the aligned reads (for _update_relationship), I choose the lowest of the two alignments. Also, I'll run it on more samples and check the ratio, but I think we could lower it a bit, like 90% (instead of 95% default now) for the ratio / aligned reads to be considered paired end library.

uniqueg commented 5 months ago

Oh, and please update the PR description to reflect all of the latest changes. As far as I understood, you also fixed some things that never really worked.