zavolanlab / htsinfer

Infer metadata for your downstream analysis straight from your RNA-seq data
Apache License 2.0
9 stars 22 forks source link

feat: sort alignments by read name #159

Open balajtimate opened 5 months ago

balajtimate commented 5 months ago

Is your feature request related to a problem? Please describe. The output SAM files from STAR in mapping.py contain the aligned reads in the same order as the input FASTQ files. This could potentially be an issue, when for the library_type inference two samples are aligned separately, and the inputs are unsorted/sorted in different ways, as the output alignments cannot be compared due to different read order. Currently, we make the assumption that the inputs are sorted either by read name or by coordinates, but it would actually be beneficial to sort the output of STAR.

Describe the solution you'd like Sort the aligned reads according to read name. This could either be done in mapping.py right after the alignment step, or in get_library_type.py, as the sorted, separately aligned files are only needed for the comparison here to calculate concordant pairs. Use pysam with the -n argument to create sorted BAM files.