Closed balajtimate closed 5 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
930b741
) 100.00% compared to head (8fe0cbe
) 100.00%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
To summarize: refactored _align_mates
in get_library_type.py
to correctly determine the number of aligned reads as well as the number of concordant reads.
When iterating through the reads, the problem during _compare_alignments
was that the order of the reference names were scrambled. Turns out, this was the result of STAR: it outputs the aligned reads in the same order as the input read, except when running it on multiple threads, which I have been doing since day 1. When I reran the samples on the default 1 thread, the output of the alignment matched between the two samples, and the concordant reads were calculated correctly. So one solution was to sort reads by ref name using pysam
, which outputs an extra sorted BAM file, but ultimately updating the STAR command with --outSAMorder PairedKeepInputOrder
seems to solve this, so there is no extra sorting needed, the output will be in the same order as input (it's also possible to sort the aligned reads with STAR, but only according to coordinates, and I think it'd be better to have sorted according to ref names. Btw, is there any advantage in this case for using SAMs instead of BAMs? I think mapping.py
could be updated to output BAMs, maybe in a separate PR).
For checking the ratio between the concordant and the aligned reads (for _update_relationship
), I choose the lowest of the two alignments. Also, I'll run it on more samples and check the ratio, but I think we could lower it a bit, like 90% (instead of 95% default now) for the ratio / aligned reads to be considered paired end library.
Oh, and please update the PR description to reflect all of the latest changes. As far as I understood, you also fixed some things that never really worked.
Description
first_mate
andsecond_mate
when it can correctly be inferred from seq_id's (when seq_ids are not empty)library source
is either inferred or given by--tax-id
argument (in case oflibrary type
as well asread orientation
inference)--outSAMorder PairedKeepInputOrder
tomapping.py
to keep input order of reads when running STAR with multiple threadsIn the case of paired-end libraries, the logic should be:
seq_ids
of first and second pair, to decide if they fit the Casava format, and the lib type and relationship can be determinedlibrary_source
is knownnot_mates
ornot_available
, infer the orientation from the separately aligned resultssplit_mates
, align the samples in paired-end mode, but only iflibrary_source
is knownFixes #153
Type of change
Checklist
Please carefully read these items and tick them off if the statements are true or do not apply.
If for some reason you are unable to tick off all boxes, please leave a comment explaining the issue you are facing so that we can work on it together.