zavolanlab / htsinfer

Infer metadata for your downstream analysis straight from your RNA-seq data
Apache License 2.0
9 stars 22 forks source link

Compare inferred read length statistics to SRA metadata #151

Closed balajtimate closed 7 months ago

balajtimate commented 7 months ago

I've been looking into the reported read length data from the SRA sample dataset, reran the ones with errors and modified the testing script so the inferred length would always get reported, no Undecided left (please compare it to the results in the performance issue)

With this, there were 563 matches and 215 mismatches left. Then I found that the read statistics from the second mate in paired samples didn't get correctly reported in the results, so I fixed that (see commit). I also added the additional read statistics (mean, median, mode) to the results, and compared the SRA reported read length to that, after which most (745 out of 779) read length results were matches.

I checked the rest individually (33 + 1 sample where there was no length metadata) on SRA, all of those had the mean read length reported in our sample data, and the difference was always ±1 (e.g. SRR18675578 with the SRA reported mean length of 138, in our sample data table we had 137 bp, while the inferred mean is 138.03, so I corrected that).

I would say the read inference works as intended now, with the additional stats getting reported, it matches the metadata from SRA.

uniqueg commented 7 months ago

That's fantastic work, thanks a lot @balajtimate :pray: